Programming Language Design Sketchbook

2010-04-29

Object Construction

There are countless ways to build objects. Here are a few.

Classic Constructors

Many class-based object oriented languages have constructors that look suspiciously like methods. Unlike instance methods which operate on existing objects, though, these constructors bring new objects into existence, and generally use a special syntax (such as an operator called new) to distinguish their invocation from that of a method.

new Point("Pantages Theater", 34.1017, -118.3255, 
    "Hollywood", "Los Angeles", "CA", "us", "90028", 
    new LocalDate(1930, 6, 4), Status.ACTIVE)

If there are too many parameters, you can pass in a map (dictionary, associative array, hash), or your language might support named parameter association.

Setters

With setter methods, you first construct an object in which all of its fields are set to default values, then you customize the object by setting just those fields that differ from the defaults.

Point p = new Point();
p.setName("Pantages Theater");
p.setLatitude(34.1017);
p.setLongitude(-118.3255);
p.setNeighborhood("Hollywood");
p.setCity("Los Angeles");
p.setState("CA");
p.setCountry("us");
p.setPostalCode("90028");
p.setEstablished(new LocalDate(1930, 6, 4));
p.setStatus(Status.ACTIVE);

This approach solves the three problems above, but introduces a new problem: the object being constructed can't be immutable. How can you get around this?

Some languages allow fields to be marked "final" so that after the first set, they can't be set again.
In some languages, you can scope the setter methods so that only the parts of the code that are supposed to the construct the objects can see the setters. Since objects should be able to be constructed anywhere, this all but requires the programmer to factor out construction code into its module, which could be unwieldy. But there are ways to do it nicely in general; one way is the builder, described next.

Builders

Builder objects are common too. They have the three advantages of the setter approach (no need to supply non-default values, order doesn't matter, and reader knows what the arguments mean), while also allowing the newly constructed object to be immutable.

Point p = new PointBuilder()
    .name("Pantages Theater");
    .latitude(34.1017);
    .longitude(-118.3255);
    .neighborhood("Hollywood");
    .city("Los Angeles");
    .state("CA");    
    .country("us");
    .postalCode("90028");
    .established(new LocalDate(1930, 6, 4));
    .status(Status.ACTIVE)
    .build();

Factory Methods

A factory method is any regular method (or regular function) that creates and returns a new object. There's no special syntax here; all that is done is to call a constructor, builder, or other device inside the body of the factory method. In Java:

public class Card {
    private Suit suit;
    private Rank rank;
    private Card(Suit s; Rank r) {       // Constructor must be private
        suit = s;                        // to force all clients to use
        rank = r;                        // the factory method below
    }
    public static Card fromRankAndSuit(Rank r, Suit s) {
        return new Card(s, r);
    }
    . . .
}

Literals

In languages without classes, or when classes are not too important (say, when there is duck typing), object literals create new objects. Here's a JavaScript example. Fields are string literals, but we get to omit the quotes if the field name is a simple word that is not a reserved word:

{name: "Pantages Theater", lat: 34.1017, "long": -118.3255, 
    neighborhood: "Hollywood", "state or province": "CA",
    country: "us", "postal code": "90028", city: "Los Angeles", 
    established: new LocalDate("1930-06-04"), status: Status.ACTIVE)}

Object literals are also possible in languages with statically typed, closed classes. In Ada:

type Point is record
    X: Integer;
    Y: Integer;
end record;
. . .
var P: Point := Point'(6, 5);
var Q: Point := Point'(Y => 5, X => 6);

Construction in Prototypal Languages

Modern (ECMAScript 5-based) JavaScript has nice features for creating objects with a given prototype. I came up with this pattern:

/*
 * An immutable circle datatype.  Synopsis:
 *
 * var c = Circle.create(5, {x: 1, y: 4});
 * c.radius      => 5
 * c.center      => {x: 1, y: 4}
 * c.area()      => 25π
 * c.perimeter() => 10π
 */
var Circle = {
    prototype: {
        area: function () {return Math.PI * this.radius * this.radius;},
        perimeter: function () {return 2 * Math.PI * this.radius;}
    },

    create: function (r, c) {
        return Object.create(this.prototype, {
            center: {value: c},
            radius: {value: r}
        });
    }
};

The circles we create with Circle.create are immutable because we use JavaScript's default property descriptors for center and radius: they're not writable, not enumberable, we can't delete them, nor can we change these settings. The same isn't true for the Circle properties (prototype and create), nor for the prototype's own properties...is it worth using property descriptors on these too?

2010-04-17

Global Variables

The concept of a global variable seems pretty simple, especially to beginning programmers writing a single-file program in a block-structured language. But for languages with modules, or languages that support multithreading or multiprocessing, or with scripting languages embedded in other systems that expose "host" objects via script variables, things are not so clear.

Most programmers would warn that globals should be used sparingly, for various reasons:

Because they can be written to from anywhere, it's harder to tell what might be going on just by looking at local portions of code. Code is harder to debug.
Globals make multithreaded programming harder to reason about, and make race conditions more likely to pop up.
Parameters are almost always a more readable and understandable way to "share" data among routines.

How do we live with, or manage, global variables? What constructs can a language have to mitigate these problems?

Languages that Require Global Variables

There's an interesting distinction between languages that allow global variables and those that require them. If a language allows pieces of a program to be brought together from different physical components (such as files), but does not provide a programming language construct like a module or package, then global variables are basically required to share information. Even if you adopt a message-passing type of communication between functions in different files, without a nice module construct, those functions are referenced with global variables.

Modules

Many languages have modules for the express purpose of limiting the number of global variables and hence minimizing the number of name collisions. The simplest example is that of a math module, found in most modern languages. Instead of global variables for E, PI, sin, cos, tan, etc., these values are wrapped in a module construct (usually called Math), which is the only top-level name exposed to the rest of the code. Examples:

# Ruby-like
module Math
  def sin(x) ... end
  def cos(x) ... end
  PI = 3.141592653589793
  E = 2.718281828459045
end

// JavaScript-like
var Math = {
    sin: function (x) {...},
    cos: function (x) {...},
    PI: 3.141592653589793,
    E: 2.718281828459045
};

# Java-like
public class Math {
    private Math() {}
    public static double sin(double x) {...}
    public static double cos(double x) {...}
    public static double PI = 3.141592653589793;
    public static double E = 2.718281828459045;
}

Modules create a namespace for these values which would otherwise be global. If there is a chance of having too many modules (that is, a possibility of having names of modules collide), modules can be grouped into some higher level named collection (Java calls this a package). Whether or not the additional construct is used, one can always use a hierarchical naming convention for global entities, starting, for instance, with a (reversed) DNS name that the programmer "owns."

Multithreading

The dangers of unprotected global variables in multithreading code (e.g., race conditions, lost updates) are so well known we don't need to explain them here. If you are stuck with a global, shared, variable for communication between threads, your language might provide some mechanism for atomic updating. For example:

atomic var balance = 0;

AtomicInteger balance = new AtomicInteger(0);

Internet Explorer Events

In most event-driven systems, an event object is passed to event handlers. If you are using JavaScript, and are writing event handlers to function on any browser except Internet Explorer, this will be the case. You write code like:

document.body.onclick = function (e) {
    alert("Clicked at (" + e.clientX + "," + e.clientY + ")");
}

With Internet Explorer, the event object is not passed to your handler; instead, the most recently fired event is accessed by a global variable (no kidding!) called event. IE gets away with this because client-side JavaScript is single-threaded: all events are queued and are handled sequentially. To make your code work on multiple browsers you can use the following idiom:

document.body.onclick = function (e) {
    if (!e) e = event;
    alert("Clicked at (" + e.clientX + "," + e.clientY + ")");
}

Sigils

Because local variables are usually better than global variables, one interesting design choice is to make locals the default and make globals ugly. Ruby does exactly this: global variables begin with a $ character. That's a sigil — a character used within a variable name to indicate its type, category, or (in this case) scope.

Example:

$x = 10
def f
  x = 3
  puts x
  puts $x
end
def g
  puts $x
end
def h
  puts x
end
f        # writes 3 then 10
g        # writes 10
puts $x  # writes 10
h        # raises a NameError

Another example:

x = 5
def f
  puts x
end
f        # raises a NameError

Hiding in Closures and Anonymous Functions

TODO

2010-03-13

Lambdas

Lambda-expressions, or anonymous functions, are common in many languages, but are either missing or approximated in others.

Simple Expressions

There are many ways to write lambda expressions when the body is a single expression:

λ x . x + 1
{x -> x + 1}                 // Curly braces are common
[x -> x + 1]                 // Not so common, looks like an array
(x -> x + 1)                 // Let the -> operator imply function type
fn x => x + 1                // ML
lambda (x) x + 1
(lambda (x) (+ x 1))
sub {$0 + 1}                 // Perl
{$0 + 1}
{_ + 1}                      // Scala
λ x y . 2 * x + y
{x y -> 2 * x + y}
fn x y => 2 * x + y
fn (x,y) => 2 * x + y
lambda (x, y) 2 * x + y
sub {2 * $0 + $1}
{2 * $0 + $1}

Some of these require a bit more lookahead to parse than others.

Complex Function Bodies

If your language has a sequencing expression, like C's comma expression, where e1, e2, e3 means to evaluate each expression in order and return the value of the last expression encountered, then you can use the sequencing expression in any of the forms above. Operator precedence becomes the only real issue, but parentheses can make it clear exactly what the function is.

{ p, i -> k = p * 0.45359237, m = 0.0254, k / (m * m) }

Let-expressions work well here, too

{ p, i -> let k = p * 0.45359237 and m = 0.0254 in k / (m * m) end }

Lambdas can be arbitrarily complex, with return statements. Some languages allow you to omit the return statement if the expression to return is the last one in the sequence.

TODO

Recursion

To arrange for an anonymous function to be called recursively, it needs to have a name or the language should reserve a special keyword to refer to the currently executing function (watch for scoping rules here).

{[f] n -> n <= 1 ? 1 : f(n - 1) * n}

{n -> n <= 1 ? 1 : CALLEE(n - 1) * n}

Allowing recursion in lambda expressions is far better than requiring the applicative Y combinator.

Closures

Lambdas often refer to variables in outer scopes. Within the lambda expression, these outside variables are called free variables.

function incrementer(int start, int delta) {
    var current = start - delta;
    return function () {return current += delta;}
}

Calling Lambda Expressions

Calling, or applying a function, is normally done with juxtaposition, though again precedence rules will often come into play.

(x -> x + 1)7
(x -> x + 1)(7)
[x -> x + 1](7 * y - 2)
[x -> x + 1] 7 * y - 2         // application or multiplication first? 
fn (x, y) => 2 * x + y (7, y)  // too confusing 
(fn (x, y) => 2 * x + y)(7, y)

Currying

Sketches:

[x -> [y -> 2 * x + y]]
[x -> y -> 2 * x + y]
[x y -> 2 * x + y]
fn x => fn y => 2 * x + y
fn x y => 2 * x + y

Fake Lambdas

TODO

Delegates

TODO

2010-03-03

Prototypes

There are two primary ways that languages allow you to create a set of objects with the same (or similar) structure and behavior. In the classical approach, you define classes and then create objects from constructors associated with the class; in this case, objects are instances of the class. In the prototypal approach, you create a prototype object, and then instantiate new objects based on the prototype. Generally, a language is either classical (e.g., Java, Ruby, C#), or prototypal (e.g, JavaScript, Self, Io, Lua), though some languages let you construct objects both ways (e.g., ActionScript, Falcon).

What might prototype-based instantiations look like?

Simple Differential Inheritance

Perhaps the purest approach to a prototypal mechanism is one in which every object has an implicit link to another object, called its prototype. Whenever we look up a property in an object and it is not present, we go look in its prototype. If the property isn't present in the prototype, we search the prototype's prototype, and so on. (Writing a property is different, however; if the property doesn't exist in the object it is created right there.) Such a language might have an operator called beget which might be used like this:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = beget protoCircle;
c1.x = 4;
c1.color = "green";

Here we have c1.x == 4, c1.y == 0, c1.radius == 1, and c1.color == "green".

We can create more circles:

var c2 = beget protoCircle;
c2.x = 4;
c2.radius = 15;
c3 = beget protoCircle;

which gives us this:

If we change any of the fields in the prototype, this change is picked up at run time by all the derived objects that haven't overridden those fields. Ditto for adding new fields to the prototype. To some people, this is convenient and flexible; to others it is ugly and error-prone. It's all those things, of course.

The prototype is also a great place to store function properties:

protoCircle.area = function () {return π * (this.radius ** 2);}

Here we've assumed we have a "this" operator which refers to the object through which the function was called (not necessarily the one along the prototype chain which ultimately contained the function).

Direct Cloning

Another approach is to define a prototype with default field values then clone it to make new objects. Cloning is different from simple differential inheritance because with cloning the new object automatically gets copy of all of the fields in the prototype, even for those that you wanted to keep the default values in the prototype. A clone operator might look like this:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = clone protoCircle;
c1.x = 4;
c1.radius = 8;
c1.color = "green";

Cloning means c1 is going to get all four fields:

Cloning via user-defined functions

If a language does not have a built-in clone operator, or you just don't like the look of creating an object and then doing multiple assignments to override defaults, you can do the cloning and field assignments in a function. Default fields can be nicely managed with default parameters, in languages that have them:

function newCircle(x = 0, y = 0, r = 1, c = "black") {
    return {x: x, y: y, radius: r, color: c};
}
var c1 = newCircle(x => 4, r => 8, color => "green");
var c2 = newCircle(color => "yellow", r => 5);
var c3 = newCircle(_, 5, _, "blue");

or with a short circuit disjunction if you have that but no default parameters:

function newCircle(x, y, r, c) {
    return {x: x || 0, y: y || 0, radius: r || 1, color: c || "black"};
}
var c1 = newCircle(x => 4, r => 8, color => "green");

With cloning, we have the problem of where to put the objects' methods. How about global variables?

function circleArea(c) {return π * c.radius ** 2;}
function circlePerimeter(c) {return 2 * π * c.radius;}

That's awful! It pollutes the global namespace. What about putting them inside the objects?

function newCircle(x = 0, y = 0, r = 1, c = "black") {
    return {
        x: x, y: y, radius: r, color: c,
        area: function () {return π * this.radius ** 2;}
        perimeter: function () {return 2 * π * this.radius;}
    };
}

That's no good either — every time a circle object is created, two additional function objects are created for its area and perimeter. If we create thousands of circles we end up with thousands of identical area functions and thousands of perimeter functions sucking up memory. We should make sure we have only one copy of each (without defining global variables to access them). We want that nice little implicit link to the prototype back.

JavaScript Constructors

In JavaScript, prefixing a function call with new calls the function in a context in which the this expression refers to a newly constructed object, and, unless a return statement explicitly returns some other object, returns this newly constructed object.

function Circle(x, y, r, c) {
    this.x = x || 0;
    this.y = y || 0;
    this.radius = r || 1;
    this.color = c || "black";
}

It's also the case that JavaScript is prototypal, and weirdly so: every function is itself an object, with a property called prototype, whose value is the object that will be used as the prototype of every object created with that function. As usual, the prototype is the place where you drop in shared properties, like functions. Here's an example:

function Circle(x, y, r, c) {
    this.x = x || 0;
    this.y = y || 0;
    this.radius = r || 1;
    this.color = c || "black";
}
Circle.prototype.area = function () {return Math.PI * this.radius * this.radius;}
Circle.prototype.perimiter = function () {return Math.PI * 2 * this.radius;}
var c1 = new Circle(3, 8, 4, "green");
var c2 - new Circle(1, 7);
var c3 = new Circle();

Better Prototypal Patterns for JavaScript

It's pretty clear that JavaScript's new operator and the associated semantics for construction were designed to make the langusge look more classical. Douglas Crockford characterizes this beautifully by saying that "JavaScript is conflicted about its prototypal nature." He shows how to make JavaScript implement simple differential inheritance:

if (typeof Object.beget !== 'function') {
    Object.beget = function(o) {
        var F = function () {};
        F.prototype = o;
        return new F();
    };
}

Our running circle example can now be written like this in JavaScript:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = Object.beget(protoCircle);
c1.x = 4;
c1.color = "green";

One of the nice things about using beget is we can hide away JavaScript's crazy new operator and never have to use it again. This operator is fairly confusing and error-prone. For example, when you call functions that are supposed to be called with new but aren't, the code ends up creating or overwriting global variables. Object.beget is a nicer approach.

ECMAScript 5 Improvements to JavaScript

Current JavaScript versions (as of mid-2010) are based on the 3rd Edition of ECMAScript, known as ECMAScript 3. The newly standardized ECMAScript 5 adds many properties to the Object object, including create, which is like the beget example above. In new JavaScript:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = Object.create(protoCircle, {
    x: {value: 4}, 
    color: {value: "green"}
});

Much better.

2010-02-27

Object-Specific Methods

Sometimes a programmer likes to attach a method to a single object, instead of attaching it to all objects of a class. How can this be done?

We'll make three dogs, Sparky, Spot, and Spike. All dogs bark, but only Spike can bite.

Classless Languages

In JavaScript, just add function properties directly to the object:

var Dog = function (name) {this.name = name;};
Dog.prototype.bark = function () {return "woof";};
var d1 = new Dog("Sparky");
var d2 = new Dog("Spot");
var d3 = new Dog("Spike");
d3.bite = function () {return "GOTCHA";};

It is hard to imagine improving on that.

Singleton Classes

In Ruby, you add the method to the object's singleton class:

class Dog
  def bark(); "woof"; end
end
d1 = Dog.new("Sparky");
d2 = Dog.new("Spot");
d3 = Dog.new("Spike");
class <<d3
  def bite(); "GOTCHA"; end
end

The use of "<<" isn't as readable as it could be, at least for English speakers, so might there be other ways to do this for classes languages?

Object.add_method(d3, "bite", {"GOTCHA"});

d3.add_method("bite", {"GOTCHA"});

def d3.bite(); "GOTCHA"; end;

Regardless of the syntax, with a class-based language, each object will have an internal reference not only to its class, but to its bundle of object-specific methods.

Subclasses

In a language where classes are closed, you "extend" classes by making new subclasses. This seems a little uglier in our contrived example. In Java:

class Dog {
    private String name;
    public Dog(String name) {this.name = name;}
    public String bark() {return "woof";}
}
class BitingDog extends Dog {
    public BitingDog(String name) {super(name);}
    public String bite() {return "GOTCHA";}
}
Dog d1 = new Dog("Sparky");
Dog d2 = new Dog("Spot");
BitingDog d3 = new BitingDog("Spike");

This may be clunky, but it does make a lot more static analysis possible.

2010-02-11

Bracketing

At some point in the history of programming languages, the idea of code as lists of statements with jumps or gotos gave way to the idea of nested (hierarchical) declarations and control structures, as in:

ast

So programs are really trees, but we still need a string representation for storage and transmission. How can we capture trees in text? There are two main approaches. The first is to just give a direct representation of the syntax tree. The other is to be creative with syntax, making something that a human would like to read.

Here are several sketches of representations of the above tree.

S-Expressions

The usual direct representation of trees in strings is that of "S-expressions" — parenthesized forms such as (A B C D) where A is the root and B, C, and D are A's (ordered) children. B, C, and D can be primitive values or trees.

(function f (params x y)
    (block 
        (var b 3)
        (if (< x (* a (+ b 7)))
            (while found
                (block
                    (for i (range 2 5) (print i))
                    (if (== a 2) (break))))
            (= p (cos 5)))
        (return (/ a 5))))

Languages like LISP, Scheme, and Clojure use S-expressions (though with a few little extensions here and there).

In the example above, some of the objects aren't actually evaluated, so the distinction between symbols and values sometimes comes up. Occasionally this is made explicit in the syntax, for example:

(function 'f (params 'x 'y)
    (block 
        (var 'b 3)
        (if (< x (* a (+ b 7)))
            (while found
                (block
                    (for 'i (range 2 5) (print i))
                    (if (== a 2) (break))))
            (= p (cos 5)))
        (return (/ a 5))))

Instead of the quote character for symbols, the colon or ampersand (or other character) could be used.

XML Style

The tree can be given directly in XML, too. This is probably too verbose, not only because we have some markup characters, but because we have to invent elements for leaf nodes.

<function name="f" params="x,y">
  <block>
    <var id="b"><intlit value="3"/></var>
    <if>
      <less>
        <ref id="x"/>
        <times>
          <ref id="a"/>
          <plus><ref id="b"/><intlit value="7"></plus>
        </times>
      </less> 
      <while>
        <ref id="found"/>
        <block>
          <for var="i">
            <range><intlit value="2"/><intlit value="5"/></range>
            <print><ref id="i"/></print>
          </for>
          <if>
            <eq><ref id="a"/><intlit value="2"/></eq>
            <break/>
          </if>
        </block>
      </while>
      <assign>
        <ref id="p"/>
        <cos><intlit value="5"/></cos>
      </assign>
    </if>
    <return>5</return>
  </block>
</function>

Reader exercise: Could this have been done without the ref and intlit elements?

JSON Style

Another form of direct tree representation is an encoding in JSON. This gets pretty verbose as well because in JSON the properties are unordered, so naming them seems to be the most reasonable approach. And again, we have to name everything:

{type: "function", name: "f", params: ["x", "y"], body: {
    type: "block", statements: [
        {type: "var", id: "b", init: 3},
        {type: "if", 
            condition: {type: "<", left: "x", right: {
                type: "*", 
                left: "a", 
                right: {type: "+", left: "b", right: 7}
            }}, 
            truePart: {type: "while",
                condition: "found", body: {type: "block", statements: [
                    {type: "for", var: "i", 
                        range: {type: "range", low: 2, high: 5}, 
                        body: {type: "print", args: ["i"]}
                    },
                    {type: "if",
                        condition: {type: "==", left: "a", right: 2},
                        truePart: {type: "break"}
                    }
                ]}
            }, 
            falsePart: {type: "assign", 
                target: "p", 
                source: {type: "cos", args: [5]}
            }
        },
        {type: "return", value: 5}
    ]
}}

Indentation

Okay, let's move away from direct representations and use some real syntax. By real syntax we mean cleverly arranging symbols in a pleasing and readable way but still having it be possible to unambiguously recapture the tree structure.

function f (x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            if a == 2
                break
    else
        p = cos(5)
    return a / 5

Curly Braces

The use of curly braces to define subtrees is, for some reason, very popular.

function f (x) {
    var b = 3;
    if (x < a * (b + 7)) {
        while (found) {
            for (i in 2..5) {
                print i;
            }
            if (a == 2) {
                break;
            }
        }
    } else {
        p = cos(5);
    }
    return a / 5;
}

Terminating Keywords

Rather than marking the beginning and end of a block, as is done with the curly brace style and its equivalents, we can get away with just marking the end. This can be done simply with the word end:

function f (x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            end
            if a == 2
                break
            end
        end
    else
        p = cos(5)
    end
    return a / 5
end

or by spelling the initial word backwards (yes, many languages do this):

function f (x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            rof
            if a == 2
                break
            fi
        elihw
    else
        p = cos(5)
    fi
    return a / 5
noitcnuf

Postfix Style

Here's a sketch using the postfix style. Here it's definitely important, I think, to distinguish symbols from variables that need to be evaluated. I used square brackets to bundle code, as should be obvious:

[
    'x' param 'y' param
    3 'b' var
    x a b 7 + * <
    [
        found
        [
            'i' 2 5 [i print] for
            2 a == [break] [] if
        ]
    while]
    [5 cos p assign]
    if
    a 5 / return
] 'f' function

Functional Style, with 'let' and Infix Operators

Here's a style that borrows from Standard ML. Block scope is introduced with the let...in...end expression, and expression sequences use semicolons as separators. Expression sequences are allowed in the body of a let, or wherever a regular expression can go, provided the sequence is enclosed in parentheses.

function f (x, y) =
    let 
        b = 3 
    in
        if x < a * (b + 7) then
            while found do (
                for i in 2..5
                    print i
                ;
                if a == 2 then
                    break
                else
                    nil
            )
        else
            p = cos(5)
        ;
        a / 5
    end

2010-02-08

Else ifs

You may have heard of the so-called dangling else problem. The text

if e1 then if e2 then s1 else s2

can reasonably mean either of two possible things:

When mixed with other statements without any obvious end, things can get crazy:

if e1 then while e2 do if e3 then s1 else s2

The dangling-else problem arises in languages with ambiguous grammars, such as the Pascal-like:

Stmt      = Assignment | IfStmt | WhileStmt | Block | ...
IfStmt    = "if" Exp "then" Stmt ("else" Stmt)?
WhileStmt = "while" Exp "do" Stmt
Block     = "begin" Stmt* "end"

or the C-like:

Stmt      = Assignment | IfStmt | WhileStmt | Block | ...
IfStmt    = "if" "(" Exp ")" Stmt ("else" Stmt)?
WhileStmt = "while" "(" Exp ")" Stmt
Block     = "{" Stmt* "}"

There are many ways to deal with this problem, but most lead to rather unsatisfactory ways to handle the general multi-way conditional. Is there a design that avoids dangling else problems and forces programmers to write nice multiway conditional statements?

Resolve with a semantic rule

In Pascal and C, the syntax is ambiguous, but a semantic rule says that each else must match the closest if, regardless of indentation. This can be terribly confusing to beginners; the code below on the left doesn't mean what novices might think:

(* Without blocks, 'else'
tied to second 'if' *)
if cond1 then
    if cond2 then
        stmt1
else (* Sorry! Goes w/ cond2! *)
    stmt2

(* MUST use block to tie
'else' to first 'if' *)
if cond1 then begin
    if cond2 then
        stmt1
end else (* Now it goes w/ cond1. *)
    stmt2

One objection to this approach is that purists will probably hate how this solution relies on a semantic rule to handle something so inherently structural. A second objection is that multiway conditionals don’t really exist—the syntax can only describe deeply nested individual conditionals. So programmers end up playing games with their indentation to simulate multiway branching:

if cond1 then
    stmt1
else if cond2 then
    stmt2
else if cond3 then
    stmt3
else if cond4 then
    stmt4
else
    stmt5

if cond1 then begin
    stmts1
end else if cond2 then begin
    stmts2
end else if cond3 then begin
    stmts3
end else if cond4 then begin
    stmts4
end else begin
    stmts5
end

This is ugly—but it could be a lot worse: programmers often mix up their conditional arms by sometimes using blocks and sometimes not, and get the ugly code they deserve. The worst part of the “semantic rule solution” dangling else syntax is that there seem to be hundreds of different ways to format multiway if statements. Take a look at some old Pascal textbooks if you want to be grossed out.

Complicate the grammar to remove the ambiguity

It is possible, with a lot of work, to force all elses to be paired with the nearest then. Call an open statement one with a sub-statement that dangles at the end, and a closed statement one that is not open. Then:

Stmt           = OpenStmt | ClosedStmt
OpenStmt       = IfThenStmt | IfThenElseStmt | WhileStmt | ForStmt
ClosedStmt     = Assignment | Call | BreakStmt | ReturnStmt
NoShortIfStmt  = ClosedStmt | NoShortIfIfThenElse
               | NoShortIfWhile | NoShortIfFor
IfThenStmt     = "if" Exp "then" Stmt
IfThenElseStmt = "if" Exp "then" NoShortIfStmt "else" Stmt
NoShortIfIfThenElse = 
    "if" EXP "then" NoShortIfStmt "else" NoShortIfStmt
NoShortIfWhile = "while" Exp "do" NoShortIfStmt
NoShortIfFor   = "for" Var "in" Range "do" NoShortIfStmt
etc.

This grammar is no longer ambiguous, and every else is automatically connected to the nearest if. But it does nothing to prevent the confusion arising from misindented code we saw above.

Require the else part

If there always has to be an else part, there is never any ambiguity, but you’ll have funny looking code when your else part is missing: you'll need some empty statement or a special null keyword.

if cond1 then
    if cond2 then
        stmt1
    else
        null
else
    stmt2

if cond1 then
    if cond2 then
        stmt1
    else
        stmt2
else
    null

Not pretty. It works, though, even if the while statement (and other compound statements) aren’t fully bracketed, for example:

if e1 then while e2 do if e3 then s1 else null else s2

and

if e1 then while e2 do if e3 then s1 else s2 else null

Require bracketing

Ultimately, languages that have compound statements where the bodies are “trailing statements” leave too many formatting choices and look ugly and unbalanced to many people. Defining a language’s syntax to require bracketed compound statements is a Good Thing for that reason, and as a bonus it removes the dangling else problem completely. The idea is that a block should not be a kind of statement, and that all compound statements use blocks for bodies, never simple statements! This grammar fragment:

Stmt      = Assignment | IfStmt | WhileStmt | ...
IfStmt    = "if" "(" Exp ")" Block ("else" Block)?
WhileStmt = "while" "(" Exp ")" Block
Block     = "{" Stmt* "}"

yields code like:

if (cond1) {
    if (cond2) {
        stmts1
    } else {
        stmts2
    }
}

if (cond1) {
    if (cond2) {
        stmts1
    }
} else {
    stmts2
}

Bracketing can also be done with just a terminating end instead of curly braces or begin-end pairs:

Stmt      = Assignment | IfStmt | WhileStmt | ...
IfStmt    = "if" Exp "then" Stmt+ ("else" STMT+)? end
WhileStmt = "while" Exp "do" Stmt+ "end"

yielding rather clean code like this:

if cond1 then
    if cond2 then
        stmts1
    else
        stmts2
    end
end

if cond1 then
    if cond2 then
        stmts1
    end
else
    stmts2
end

But, wait—if we really require bracketing, won’t that make multiway conditionals ugly? Either you start indenting too much or you get a bunch of }s (or ends) at the end. Like this, right?

if cond1 {
    stmts1
} else {
    if cond2 {
        stmts2
    } else {
        if cond3 {
            stmts3
        } else {
            if cond4 {
                stmts4
            } else {
                stmts5
            }
        }
    }
}

if cond1 {
    stmts1
} else { if cond2 {
    stmts2
} else { if cond3 {
    stmts3
} else { if cond4 {
    stmts4
} else {
    stmts5
}}}}

Make indentation matter

In Python, there is no dangling else problem since the indentation makes things clear:

if cond1:
    if cond2:
        stmts1
else
   stmts2

if cond1:
    if cond2:
        stmts1
    else
        stmts2

But, wouldn't required indentation lead to funny looking code in multiway conditionals?

if cond1:
    stmts1
else:
    if cond2:
        stmts2
    else
        if cond3:
            stmts3
        else
            if cond4:
                stmts4
            else
                stmts5

While the code above is legal Python, real Python programmers use the next solution.

Introduce a new keyword to simplify required bracketing or indentation

In most languages where bracketing (or indentation) is required, the multiway conditional is described syntactically as a single if-statement, usually with the help of a special keyword (called elsif in Ada, Ruby, and Perl; elif in bash and Python; and elseif in PHP. We have both a curly-brace style:

IfStmt =  "if" "(" Exp ")" Block
          ("elsif" "(" Exp ")" Block)*
          ("else" Block)?
Block  =  "{" STMT* "}"

and terminating-end style:

IfStmt = "if" EXP "then" Stmt +
          ("elsif"  Exp "then" Stmt+)*
          ("else" Stmt+)?
          "end"

Code looks like this:

# Ruby
if cond1
  stmts1
elsif cond2
  stmts2
elsif cond3
  stmts3
elsif cond4
  stmts4
else
  stmts5
end

# Python
if cond1:
    stmts1
elif cond2:
    stmts2
elif cond3:
    stmts3
elif cond4:
    stmts4
else
    stmts5

// PHP
if (cond1) {
    stmts1
} elseif (cond2) {
    stmts2
} elseif (cond3) {
    stmts3
} elseif (cond4) {
    stmts4
} else {
    stmts5
}

Lisp’s COND

Because Lisp is basically written in abstract syntax trees, you won’t ever have a dangling else, and the multiway conditional is already handled by COND:

(COND
  (condition1 block1)
  (condition2 block2)
  (condition3 block3)
  (condition4 block4)
  (T block5))

Interestingly, Erlang uses the same idea, though with a terminating-end syntax. There are no explicit elses; Erlang’s if is inherently multiway:

if
  condition1 -> body1;
  condition2 -> body2;
  condition3 -> body3;
  condition4 -> body4;
end.

Require bracketing but accept lookahead

There is a way to require bracketing without resorting to special words like elsif or elif and without ending up with a whole mess of terminators at the end. The solution is amazingly simple:

IfStmt = "if" "(" Exp ")" Block
         ("else" "if" "(" Exp ")" Block)*
         ('else' Block)?
Block  = "{" Stmt* '}'

I haven't seen this before. Why is this not popular? The only thing I can think of is that top-down parsers will need a two-token lookahead when encountering an else. Why should this be a big deal? Just peek ahead to see if the next token is an if (or a {).
How can we do this for a terminating-end syntax? This doesn't work very well:

IfStmt = "if" Exp "then" Stmt+
         ("else" "if" Exp "then" Stmt+)*
         ("else" Stmt+)?
         "end"

because the amount of required lookahead is infinite. What about rejecting 'if' statements in the final else-part?

Stmt   = IfStmt | NonIfStmt
IfStmt = "if" Exp "then" Stmt+
         ("else" "if" Exp "then" Stmt+)*
         ("else" NonIfStmt Stmt*)?
         "end"

There we go! As long as NonIfStmt cannot be empty and cannot start with if, we can parse top-down with a lookahead of 2.

Novel syntactic forms

If the objection to elsif and related words is just that they are made up, we could give the if-statement a make over, using more reasonable words, or symbols, even::

when cond1 do
    stmts1
or when cond2 do
    stmts2
or when cond3 do
    stmts3
or when cond4 do
    stmts
or else do
    stmts
end

try cond1 do
    stmts1
or cond2 do
    stmts2
or cond3 do
    stmts3
or cond4 do
    stmts
otherwise
    stmts
end

try [ cond1 ] =>
    stmts1
[ cond2 ] =>
    stmts2
[ cond3 ] =>
    stmts3
[ cond4 ] =>
    stmts
[] =>
    stmts
end