Programming Language Design Sketchbook

Type Coercion

2011-03-05T14:08:00.000-08:00

A type conversion is an operation that takes a value of one type and returns a value of another type such that the two values are somehow equivalent. Examples include:

float(5) ⇒ 5.0 // integer to float
string('s') ⇒ "s" // character to string
codepoint('A') ⇒ 0x41 // character to integer
A (Java) FileInputStream converted to a ByteArrayInputStream by first loading into memory

If no conversion operator explicitly appears in the code, the conversion is called a type coercion. Like all implicit operations, coercions can be convenient (less source code to type and read), but also dangerous (as things happen without the programmer's explicit authorization).

How can languages make conversions and coercions available to programmers? Coercions are usually dealt with in one of three ways:

Allow all corecions
Disallow all coercions
Allow only certain coercions, such as subtypes to supertypes

No Coercions Ever

Languages like Ada and ML have a very simple rule: no coercions ever. One cannot even use an integer where a floating point value is expected, without explicitly casting the integer to a float.

Implicit Upcasting

Some languages, such as Java, allow casting in only one way, from subtypes to supertypes. This is called upcasting. You can use an integer where a float is expected, and the integer value will be coerced to the right floating point value. You can't go the other way automatically, though.

Implicit Downcasting

Coercions from a supertype to a subtype (downcasting) are inherently dangerous because information is being lost. In C:

int x = 8.2;            // x gets 8
int y = 8.5;            // does y get 8 or 9?

Directly assigning a floating point value to an integer variable in its declaration is unrealistic; here's something that is more likely to occur:

void f(char* s, int n) { ... }
.
.
.
f("some text", z);   // z is a float, coerced to a int within f

Again, this may not be too bad in practice; z will be truncated or rounded and the resulting value will be passed to the parameter n. Unless z is a global variable and is assigned the value of n within f, there's no problem, right? People would never do such a thing, right?

Never say never! SAP has a product called Business Objects Data Integrator (BODI), with an embedded database programming language. In the Data Integrator Reference Guide, on page 453, they describe this function:

ifthenelse: Allows conditional logic in expressions.
Syntax
ifthenelse(condition, true_branch, false_branch)
Return value
true_branch or false_branch
Returns one of the values provided, based on the result of condition. The data type of the return value is the data type of the expression in true_branch. If the data type of false_branch is not convertible to the data type of true_branch, Data Integrator produces an error at validation. If the data types are convertible but don’t match, a warning appears at validation.

"The type of the return value is the datatype of the expression in true_branch"? That's not exactly intuitive. Combined with the fact that BODI (or the underlying database it may use) does implicit coercions from floats to integers, users are opened up to the following disaster:

P.latitude = ifthenelse(P.latitude is null, 0, P.latitude)

This is an attempt to clean up a null value by setting it to 0. But assuming the latitude is a regular floating point value, this assignment destroys the value, turning 43.2, say, into 43. If latitudes and longitudes are both "adjusted" this way, places in good neighborhoods end up in bad neighborhoods or in the middle of a river or ocean.

Coercions and Type Systems

Why would BODI even create an ifthenelse operator that would coerce an argument? This is a classic case of mixing static typing with weak typing. Note:

Static Typing means that all expressions must be fully typed at compile time, while dynamic typing means that types are resolved at run time.
Strong Typing means that operators applied to arguments of the wrong type generate errors, while weak typing means that the values will be coerced (wherever possible) to make the operation valid.

Static typing goes well with strong typing (ML, Ada), while dynamic typing can be combined with either strong (Ruby) or weak (Perl, JavaScript) typing, without too many surprises.

How should BODI's ifthenelse operator have been defined? Using static and weak typing together is not really the worst element at play here; it's the implicit downcasting of P.latitude to an integer. Why did they do this? Perhaps the designers of this language thought it would be "too complicated" to define the return type as "the most general type of true_branch and false_branch" and thought it "too restrictive" to force the two types to match exactly. They could have at least detected the fact that the false_branch had a more general type then the true_branch and generated and error there. Perhaps they thought that "too confusing." But wait, taking the type of true_branch as the result type is asymmetric and confusing on its own! How is that intuitive?

Note that with dynamic typing this problem never would have arisen since the ifthenelse expression would have produced either its second or third argument and hence taken on the type of the value returned, whatever that was. A dynamically typed language would never have even considered a coercion in that case at all! And a statically typed language with strong typing would have rejected the expression outright.

Static + weak is the problem. Language designers would be wise to avoid this combination.

Dictionaries

2010-12-31T00:04:00.000-08:00

Nearly every non-trivial application makes use of dictionaries, a.k.a. maps, associative arrays, hashes, lookup tables. What are some useful and convenient forms for these objects?

Types of Dictionaries

A look at the Java Collections and Concurrency APIs shows that there are quite a few kinds of dictionaries out there. Among them (note that these types are not mutually exclusive) are:

HashMap
SortedMap
TreeMap
LinkedHashMap
ConcurrentMap
NavigableMap
SkipListMap
IdentityHashMap
EnumMap
WeakHashMap

Literal Forms

If you've only coded in Java, C, Fortran, Pascal, Ada, or similar languages, you're missing out on the fun of writing dictionary literals. Here is a sampling of syntaxes:

{"CA":"Sacramento", "HI":"Honolulu", "OR": "Salem", "MA":"Boston"}
{CA:"Sacramento", HI:"Honolulu", OR: "Salem", MA:"Boston"}
{"CA" => "Sacramento", "HI" => "Honolulu", "OR" => "Salem", "MA" =>"Boston"}
{CA => "Sacramento", HI => "Honolulu", OR => "Salem", MA => "Boston"}
{:CA => "Sacramento", :HI => "Honolulu", :OR => "Salem", :MA => "Boston"}

The colon as a separator is probably the most popular; it appears in well-known data interchange formats like JSON and YAML. The => is called a hashrocket, because dictionaries are sometimes called hashes. (They shouldn't be called hashes in general; that term should only be used when you know that internally the dictionary uses a hash table for storage. Often you'll do fine with a search tree or a plain old array.)

Coping without Literals

Some languages do not allow literal forms for dictionaries. What can you do? In a dyanmically typed language, some kind of constructor taking a string with delimiters works pretty well:

def dictionary (s)
  d = Dictionary.new;
  pairs = s.split("|")
  for pair in pairs
    key, value = pair.split(",")
    d.put(key, value)
  end
  return d
end

states = dictionary("CA,Sacramento|HI,Honolulu|OR,Salem|MA,Boston");

Works for this one case, but in general the delimiters should be parameters...

Java programmers can combine static initializers with anonymous subclasses to approximate literals:

Map<String, String> states = new HashMap<String, String>() {
    {
        put("CA", "Sacramento");
        put("HI", "Honolulu");
        put("OR", "Salem");
        put("MA", "Boston");
    }
};

Testing Key Equality

TODO

Static Typing

TODO

Multimaps

TODO

Invocation

2010-11-30T17:17:00.000-08:00

The number of possibile ways to construct a subroutine (function, procedure, continuation, thread, task, process, etc.) invocation are enormous. There are dozens of known ways to express an invocation syntactically, and probably a dozen known semantics for argument passing. This brief article lists many current approaches, with examples, and offers some new styles.

Syntactic Invocation Patterns

There are dozens of ways to form an invocation, syntactically

Prefix, parentheses required after the routine name
- f(x, y, z)
- f()
- f(x)
- f(x, g(y, z))
- f(x, g(y), z)
- "+"(x, y)
- op+(x, y)
- operator+(x, y)

Semantics of Invocation

Semantics of Argument Passing

Generators

2010-10-31T01:09:00.000-07:00

A generator is a entity that you can call at any time to request the next value in some sequence. For simplicity, this post will focus on a particular generator: one for Fibonacci numbers.

Generators as objects

In an object-oriented language, one can create a specific generator with a property holding state and a method to return the next value. In JavaScript:

var fibonacciGenerator = { 
    a: 0, 
    b: 1, 
    next: function () { 
        var result = this.b; 
        this.b += this.a; 
        return this.a = result; 
    } 
};

Normally, you want to protect the generator state from tampering. One way to do this is to use a module that supports private variables. In Ruby:

Similarly, in a classical (as opposed to prototypal) OO language, you can make a class for your generator. Then different instances can be constructed, possibly with different starting points or different rules. Again in Ruby:

Implementing Generators with Closures

Generators in C

Language Support for Generators

Anonymous Functions

2010-10-30T23:02:00.000-07:00

A function is something you invoke, passing zero or more inputs (called arguments) and getting back zero or more outputs (called results). Functions generally have parameters (which match up with the arguments they accept), and a body, which contains the code that the function executes. What isn't necessary is that the function have a name. A function without a name is called — not surprisingly — an anonymous function.

How can we make anonymous functions easy to read and write?

Preliminaries

A few things to consider when designing support for anonymous functions:

Should the function body be a single expression or a block with a return statement or statements?
What kind of syntactic device should be used to delimit the function?
Are functions first-class values (as in Scheme, Haskell, or JavaScript) or are simple "blocks" or "continuations" sufficient?

Terse Designs

When the function body is a simple expression, there are a number of very terse ways to denote a function, many of which are used in existing languages. Here are some sketches for a function whose sole input is (an integer) x, and whose result is x/2 if x is even and 3*x+1 otherwise:

[x -> if x mod 2 = 0 then x div 2 else 3 * x + 1]
{x → x % 2 === 0 ? x / 2 : 3*x + 1}
λ x . (x / 2 if x % 2 == 0 else 3 * x + 1)
(LAMBDA (x) (IF (= (% X 2) 0) (/ X 2) (+ (* 3 X) 1)))
fn x => if x mod 2 = 0 then x / 2 else 3 * x + 1
{|x| x % 2 === 0 ? x / 2 : 3*x + 1}
function from x to x % 2 === 0 ? x / 2 : 3*x + 1 end
x to x % 2 === 0 ? x / 2 : 3*x + 1
fun (x) returning x % 2 === 0 ? x / 2 : 3*x + 1 end fun

For those forms which do not have special outer brackets to indicate a function expression, regular parentheses can be used to delineate them when they appear in the middle of a larger expression.

Verbose Designs

Function expressions that have complex bodies, such as those with several statements, including return statements can simply be written as function declarations without a name. See the JavaScript example below.

Uses of Anonymous Functions

There are two main uses of anonymous functions:

Arguments to functions like map, filter, and reduce.
To act as a module, in which local variables play the role of private variables. An example in JavaScript, in which a variable is shared between two exported global functions:
```
(function () {
    var x = 0;
    f = function () {return x -= 5;}
    g = function () {return x = 2 * Math.abs(x);}
}());
```

Values and No Values

2010-09-30T23:14:00.000-07:00

Consider something as simple as recording a property for an object, say, a person's supervisor. What kind of possibilities exist? Here are some:

I have a supervisor and her name is Alice.
I definitely have no supervisor.
I may or may not have a supervisor; I really don't know.
I do know whether I have a supervisor, but I don't care to make this information public; in other words, it's none of your business.

How can we capture these cases in a programming language?

Strings Only

For languages with unsophisticated type systems (say, strings or symbols only), you might try:

var supervisor1 = "Alice";
var supervisor2 = "None";
var supervisor3 = "Unknown";
var supervisor4 = "Private Info";

The obvious problem here is that people can have any name at all, even "None", "Nobody", "N/A", "Robert'); drop table students;--", or even "". You'd have to express in comments, and add logic to your application, that certain strings really aren't names. This isn't what we'd call a clean solution. What we want is a solution in which the values representing no (or an unknown) supervisor belong to a type or types other than plain string.

Disjoint Sum Types

ML and related languages feature a very clean syntax for defining flexible datatypes:

datatype Person = None | Unknown | WillNotSay | Actual of string;
val supervisor1 = Actual "Alice";
val supervisor2 = None;
val supervisor3 = Unknown;
val supervisor4 = WillNotSay;

This works very nicely: the type of None is Person, not string. In fact, the type of each supervisor variable above is (inferred to be) Person. While this is a nice statically-typed solution, a programmer would need to add the special constants None, Unknown, and WillNotSay to all types for which this kind of information is relevant. You can do this "once" with a polymorphic type:

datatype 'a info = None | Unknown | WillNotSay | Actual of 'a;
val supervisor1 = Actual "Alice";
val supervisor2: string info = None;
val supervisor3: string info = Unknown;
val supervisor4: string info = WillNotSay;

Here we've given explicit types to supervisor2, supervisor3, and supervisor4, since the None, Unknown, and WillNotSay constructors are polymorphic.

JavaScript's null and undefined

In a dynamically typed language like JavaScript we often have values in distinct types that represent nothingness or lack of knowlege. In JavaScript, for example:

null is the sole member of the Null type and represents the certainty of having no real value.
undefined is the sole member of the Undefined type and represents the lack of knowledge about its real value.

JavaScript doesn't natively distinguish between cases 3 and 4 — lack of knowledge and the refusal to share that knowledge. It could be argued that isn't really a common case anyway. Now you could, if you really wanted to, make this distinction somewhat like this:

var alice = {name: "Alice"; supervisor: null};
var bob = {name: "Bob"; supervisor: alice};
var eve = (name: "Eve"; supervisor: undefined};
var mallory = {name: "Mallory"}

This uses the lack of a supervisor property of Mallory to say she makes no claim to even having a supervisor, and isn't telling you if she has one at all. This approach is a little ugly in practice since, in JavaScript anyway, evaluating mallory.supervisor produces undefined! You'd have to dig into the object and examine its properties in order to pick up on the difference.

Marker Objects

Another approach that works well in a dynamically-typed language is to create the special values yourself, as simple, plain old objects. In JavaScript:

var NONE = {};
var UNKNOWN = {};
var WILL_NOT_SAY = {};

It should be fairly easy to figure out how to use these values.

This approach doesn't quite work as well in a statically-typed language. In Java, for example, marker objects would be implemented like this:

public static final Object NONE = new Object();
public static final Object UNKNOWN = new Object();
public static final Object WILL_NOT_SAY = new Object();

But, this means, of course, properties such as supervisor will have to be given the Java type Object, which goes against the whole point of using a statically typed language. ML's disjoint sum types are better for static languages.

Time, Dates, and Calendars

2010-08-22T11:02:00.000-07:00

Few programming languages have direct support for times and dates, leaving such support to libraries. This is fine. However, many languages (including Java and JavaScript) have time and date support in their standard libraries which is very poor. Java's Calendar and Date classes are particularly pathetic; JavaScript's is lame but at least fully admits its deficiencies, promising nothing. Fortunately programmers can turn to third-party libraries (including the excellent Joda-Time for Java).

Here I want to convey the basic issues in representing temporal data and operations in programming, sketch out ways to make better libraries, and look at ways to include time and date support within a language.

Basic Definitions

Many programmers have an insufficient understanding of time and dates. Here is a short, and in many cases oversimplified, overview of the main concepts:

Time

The one-dimensional continuum of experiences.

Instant

A point in time. Because time is one-dimensional, an instant can be given with a single real number, once we choose an origin and define a unit. The origin (instant 0) of the one-dimensional space is called the epoch. Common units for the description and measurement of time include the second and the millisecond.

Duration

A quantity of time, for example "17343.3422 milliseconds." Note that instants are points but durations are vectors. An instant plus a duration is an instant; An instant minus an instant is a duration.

Interval

The notion of elapsed time between two particular instants. Given an interval, you can compute its duration, but not vice versa. For example, the interval between today at noon and tomorrow at noon different from the interval between yesterday noon and today noon, even though the durations of each interval are the same.

DateTime

A description of an instant in human-friendly terms, generally, but not necessarily, based on aspects of the Earth, Sun, and Moon. DateTimes are often written with many components, e.g., 2010-08-22T17:29:41-07:00 has year, month, day, hour, minute, second, and offset components, respectively.

Calendar

A framework for describing datetimes. Also called a chronology. The chronologies natively supported by Joda-Time are:

Gregorian
Julian
GJ
ISO
Buddhist
Coptic
Ethiopic

Era

A large division of dates within a calendar. For example, the GJ calendar contains two eras: BCE and CE (also known as BC and AD).

Partial

A fragment of a datetime, leaving parts unspecified. For example: "August 1", "14:40". These are useful to describe recurring events, such as birthdays. They generally do not map to any particular instant or interval, but rather to many.

TimeZone

An optional component of a date time that allows people to have similar associations between the time portion of a datetime and the position of the sun relative to their location on Earth. Usually represented as an offset (duration), which can vary according to local law (e.g., Daylight Saving).

Period

Similar to a duration, but expressed with human-friendly terms like months, weeks, and days. A period does not have an exact number of milliseconds; for example "2 months" can have a different number of days, depending on the time of year; and even "1 day" can have 22, 23, or 24 hours when daylight time is taken into account.

A Time Library

A good time library should maintain types for each of the above concepts, as well as a rich set of parsing and formatting routines for dates, conversions between types, and operators. Joda-Time qualifies.

The ability to construct new calendars (chronologies) is also desirable. What would such a capability look like?

TODO

Native Language Support for Time

Because the concepts of time of dates are complex, language-level support for all of the concepts introduced above can become quite cumbersome. How many reserved words or special operators can a language, especially a general purpose one, absorb? How many literal forms should be added?

TODO

Default Parameters

2010-07-22T18:38:00.000-07:00

Subroutine calls can be simplified when one doesn't have to provide arguments for every parameter.

Specifying defaults in the signature

If you're lucky, your language might allow your signature to specify defaults:

function f(x = 3, y = 5, z = 10) {
    ...
}

f();            // same as f(3, 5, 10);
f(44);          // same as f(44, 5, 10);
f(32, 1);       // same as f(32, 1, 10);
f(17, 15, 22);

Other forms for specifying defaults:

(DEFUN f ((x 3) (y 5) (z 10)) ...)
def f(x := 3, y := 5, z := 10) ... end

Supplying defaults in the subroutine body

Some languages don't support a syntax in which defaults can be supplied in a signature, but they do let calls be made with fewer arguments than parameters. Parameters for missing arguments automatically receive a default value like nil or undefined. Whatever the value, these languages almost always treat it as falsy, allowing the use of an ||= assignment, as in:

function f(x, y, z) {
    x ||= 3;
    y ||= 5;
    z ||= 10;
    ...
}

The problem here is that other values which you might want to pass in might be treated as falsy by a language (such as 0 or the empty string), so this approach will not work in general. The idea can be implemented if the language has an operator to determine whether a value is "defined":

function f(x, y, z) {
    x = defined x ? x : 3;
    y = defined y ? y : 5;
    z = defined z ? z : 10;
    ...
}

function f(x, y, z) {
    x = 3 if not defined x;
    y = 5 if not defined x;
    z = 10 if not defined x;
    ...
}

function f(x, y, z) {
    if (!defined(x)) {
        x = 3;
    }
    if (!defined(y)) {
        y = 5;
    }
    if (!defined(z)) {
        z = 10;
    }
    ...
}

These forms aren't as pretty because of the logic required in the body as opposed to the signature.

Overloading

Some languages allow neither defaults to be specified nor calls with too few arguments. In this case you can simulate defaults with overloading. In Java:

public void f(int x, int y, int z) {...}
public void f(int x, int y) {f(x, y, 10);}
public void f(int x) {f(x, 5, 10);}
public void f() {f(3, 5, 10);}

REST

2010-06-28T16:40:00.000-07:00

REST, or representational state transfer, is a set of architectural principles used in the design of the world wide web, that have proven the key to the web's efficiency and scalability. What are these principles, and how can they be used in the design of programming languages?

Architectural Constraints

REST was described by Roy Fielding in his dissertation. He called out the following set of architectural constraints:

Client-Server — separates UI from data storage
Stateless Server — improves reliability and scalability
Client Cache — reduces some network traffic
Uniform Interface — decouples implementations from the services they provide
Layered System — means that each component only be concerned with those just below or just above it
Code-on-Demand — allows client functionality to be extended by downloading applets or scripts

Principles of REST

A RESTful system is characterized by

Addressable Resources
- Every resource has a (unique) identifier (e.g., URI)
- Example: http://www.yourcompany.com/products/2443
- NOT http://www.yourcompany.com/getProduct?id=2443
- Resources are separate from their representations
Representation-Orientation
- Requests for resources return representations
- The representation contains the state of the resource
- The client can update or delete resources through their representations
Uniform, Constrained Interface
- Small, fixed number of verbs
- Example: In HTTP, GET, PUT, POST, DELETE, HEAD, OPTIONS, TRACE only!
- An example of a non-constrained interface is an RPC-based protocol (e.g., SOAP, CORBA)
Self-Descriptive Messages
- Messages include all the information necessary to be processed
- Example: Internet Media Types
Stateless Server
- With no shared context on the server, client requests are independent
- Servers can be simpler, "more" multithreaded, easier to monitor, reliable, scalable, etc.
Cacheability
- Responses indicate whether the client can cache the data or not
- Obviously can cut down on network traffic if the client knows it doesn't have to ask for the same data again
HATEOAS
- Hypermedia as the engine of application state
- Representations should include links to related resources

Can we apply any of these principles to programming language design?

System-wide unique identifiers

The idea of identifiers having wide scope, whether unique across a system or unique across the entire world (such as URIs) exists in many forms: UUIDs, MAC addresses, public IP addresses, etc. In programming languages, Java package names are encouraged to be globally unique by using a (reversed) DNS name over which you have control.

Resource/Representation Separation

A limited, fixed, set of verbs

Self-descriptive content

Statelessness

Cachability

Linkage of Representations

Eval

2010-06-22T21:36:00.000-07:00

A common feature in many scripting languages is an eval function. Pass it a string representing source code, and the function will compile it and then interpret it. This function is called eval if the source code string represents an expression that produces a value, and might be called exec if it does not.

Eval is controversial and much-maligned because it is (1) so often abused, (2) so often used when much better alternatives exist, (3) is slow, and (4) can lead to disaster if misused. It is not rare to hear the claim "eval is evil." So the questions are: should a language allow eval and if so, how should this feature be designed?

Problems with eval

Things to watch out for:

Eval is slow because a compiler has to be launched to lex and parse the code string, prior to evaluating it.
Eval is lame because programmers may sometimes use eval without thinking. While it would be rare for anyone to blatantly pass a fixed string to eval, such as
```
eval("x = 3;");
```
one can usually get away with creating an anonymous function.
Eval is a security hole. Since the only reason for using eval is to run code that is supplied at run time, it's possible that this code may come from an untrusted or malicious source. Allowing just anyone to run code on your own machine is crazy.

When is eval not evil?

If you are going to use eval, the string must be completely sanitized. You need to check it for infinite loops, assignments, or calls with side-effects that might destroy the integrity of your application. For example, here is a JavaScript application that accepts an arithmetic expression from a user and evaluates it, first checking to make sure the input consists only of digits, parentheses, and the four basic arithmetic operators:

var s = prompt("Enter a numeric formula");
if (/[^\d()+*/-]/.test(s)) {
    alert("I don't trust that input");
} else {
    alert(eval(s));
}

Eval and Language Design

Either a language will allow evaluation of code strings or it will not. If it does, we can provide that functionality through a function or an operator. Because it is dangerous, it is definitely a candidate construct for being required to appear inside a "warning" construct, or similarly, disallowed from strict modes. Examples:

UNSAFE module Calcuator {
    ....
    // use eval here
    ....
}

use module UNSAFE;
application Calcuator {
    ....
    // use UNSAFE.eval here
    ....
}

Eval as a function

Generally, we see eval as a global function, or a member of a module entitled something like Kernel, System, or perhaps better, UNSAFE.

Eval as an operator

Another way to call attention to the use of eval is to simply make it a unary operator on strings. Perhaps:

var s = prompt("Enter an expression");
alert( `` s );

Alternatively, we might see the string to be evaled enclosed in angle or other brackets.

Disallowing Eval

Clearly we could imagine a language without an eval function or operator. C is one such language.

Forced Sanitization

An interesting new idea is for the eval function to take a second parameter which is (1) a block or anonymous function that performs sanitization, or (2) a regex which will be applied to the string so that the string will only be applied if the regex pattern matches. Example:

var s = prompt("Enter a numeric formula");
alert(eval(s, /[^\d()+*/-]/));

Throwing an exception on not matching would likely be the best course of action here.

Regular Expressions

2010-06-06T02:42:00.000-07:00

A regular expression (or regex, regexp, r.e.) is a pattern that matches a string or a portion of a string. They are used for validation, search, and find/replace. You'll find them in Java, JavaScript, Ruby, Python, Perl, and dozens of other popular languages. All languages agree on a common notation core, e.g. square brackets for character classes, ? for zero-or-one, * for zero or more, + for one-or-more, parentheses for grouping and capturing, and so on.

There's room for some notational innovations here, though. We'll cover a little of what's already in use and sketch some new ideas, too.

Examples from the Common Notation

Virtually all languages will interpret these the same way:

hello
gray|grey
gr(a|e)y
gr[ae]y
colou?r
go*gle
go+gle
g(oo)+gle
z{3}
z{3,6}
z{3,}
[Bb]rainf\*\*k
\d{5}(-\d{4})?
1\d{10}
[2-9]|[12]\d|3[0-6]
Hello\nworld
b..b
\d+(\.\d\d)?
^dog
dog$
^dog$
sh[^i]t
\d+(\.\d+([Ee][+-]?\d+)?)?
https?://(?:www\.)citysearch\.com/profile/[^/]+/(\d+)/?

Metacharacters and Escapes

Most characters stand for themsleves in a pattern. The ones that don't are called metacharacters. Most languages have 14 metacharacters (some more, some less). The usual ones are:

    |    ?    *    +    .    ^    $
    (    )    [    ]    {    }    \

If you want to use a metacharacter as a regular character, you have to escape it, e.g. \+, \[, \\, etc. The backslash also introduces dozens of other simplified expressions:

...
... TODO ...
...

Defining Patterns

How do you write a regex, such as \d+(\.\d+)? in program text? How about in a string literal?

'\d+(\.\d+)?'

Maybe.... The thing is most languages use \ to escape within a string literal. Perl and Ruby don't make the backslash special if the string literal has single quotes, but they do for double quotes, like most languages. In that case, you have to write:

"\\d+(\\.\\d+)?"

And if you need to write a regex that has to match a backslash character, the regex is \\, which looks like this in a string literal:

"\\\\"

But a regex is not a string: it has to be compiled into a little program that matches:

var r = re.compile("\\d+(\\.\\d+)?")
r = "\\d+(\\.\\d+)?".toregex
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");

But since regexes are so important, and that backslash issue gets so annoying, we'd like a language to have a special syntax for describing regexes—something that automatically compiles once, and frees us from doubly escaping backslashes. Common notations include Ruby's %r
and slash delimiters:

%r{\d+(\.\d+)?}
/\d+(\.\d+)?/

The problem with the slash approach is that you have to escape any / characters you want in the regex itself. The %r approach has so such trouble; the language's parser can figure out which is the terminating } easily.

Matchers

To use a regex you match it against some text. This allows you to

Validate that the text conforms to a desired pattern
Find (and extract) the parts of the text that do conform
Replace matching portions of the text with something else

So these matches are stateful ... TODO ... Once you have a (compiled, immutable) pattern, you use it by creating a (stateful) matcher.

Character Classes

Quantifiers

Groups and Capturing

Backreferences

Boundaries

Lookaround

Modifiers

List Comprehensions

2010-05-20T09:08:00.000-07:00

A list comprehension is a compact syntactic form for a list (or array) that is built from applying an operation to certain elements of an existing list (or array). For example:

[7*x-3 | x in 1..n | x mod 3 == 1]

means the list of all values made by computing 7 * x - 3, for each x in 1 through n where x mod 3 is 1. If n is 10, for example, the generated list is [4, 25, 46, 67]. The three parts of the comprehension are called the output (expression), the input (range) and the filter (predicate).

Common uses of comprehensions

[7*x-3 | x in 1..n | x mod 3 == 1]  // operate on (map) and filter a range 
[7*x-3 | x in a | x mod 3 == 1]     // map and filter an existing list
[7*x-3 | x in 1..n | true]          // no filtering, just map
[x | x in 1..100 | isPrime(x) ]     // just a filter, no map

Sketches

{7*x-3 | x in 1..n | x mod 3 == 1}
[7*x-3 for x in a where x mod 3 == 1]
[7*x-3 for x in a if x mod 3 == 1]              // Python
[7*x-3 for each (i in a) if (x % 3 == 1)]       // JavaScript 1.7 (uncommon)
for (i <- (1 to 10) if x % 3 == 1) yield 7*x-3  // Scala
[7*x-3, x in 1..n, x mod 3 == 1]
{7*x-3 : x ∈ 1..n ∧ x mod 3 == 1}               // Plain old Math!!
[7*x-3 : x ∈ 1..n, x mod 3 == 1]
[7*_-3 | a | _ mod 3 == 1]                      // implicit variable

Complex input sets

How do we handle the idea of two range variables for the input set? We can write two range expressions or range over a set of tuples:

[2*x+y | x in 1..n, y in 'a'..'z' | x mod 3 == 1 and isVowel(y)]
[(x, y*y) | (x, y) ∈ a × (1..10) | x < y]
[x + y for x in a for y in b]                   // Python

Life without comprehensions

If you don't have comprehensions, or a compact, symbolic list append operator, you end up writing this:

List<Integer> a = new List<Integer>();
for (int x = 1; x <= n; x++) {
    if (x % 3 == 1) {
        a.append(7 * x - 3);
    }
}

Many programmers like that kind of coding. More verbose, but everyone can understand it.
In languages that treat for and if as suffixes to simple statements (like Perl and Ruby), we get compact forms that can be nearly as terse as comprehensions:

a = [];
for x in 1..n
  a << 7 * x - 3 if x mod 3 == 1
end

a = [];
a << 7 * x - 3 if x mod 3 == 1 for x in 1..n

a = [];
a << 7 * x - 3 for x in 1..n where x mod 3 == 1

And, of course, there are map and filter (or select as it is called in Ruby)...

(1..n).filter(x -> x mod 3 == 1).map(x -> 7 * x - 3)
(1..10).select{|x| x % 3 == 1}.map{|x| 7 * x - 3}         // Ruby
(1 to n) filter { _%3==1 } map { 7*_-3 }                  // Scala
map((x -> 7 * x - 3), filter((x -> x mod 3 == 1), (1..n)))

Try the previous examples with your favorite form of lambda expression.

Here's an interesting variation, in Ruby that uses map but not select. Instead we add a if clause to the map, which generates extra nils, which we squeeze out with compact. Not pretty, but it gets that nice little "if" in there. Fun.

(1..n).map{|x| 7 * x - 3 if x % 3 == 0}.compact

Information Hiding

2010-05-20T08:21:00.000-07:00

Information hiding occurs when a language allows some program entities to be invisible to parts of a program. When applied to data within an object or module, we usually use the more specific term encapsulation. Information hiding is a good thing since it can make it harder to write certain types of mailicious code, and make it harder to accidentally ruin parts of a system you didn't mean to touch.

Here are some different language forms for information hiding.

Explicit information hiding

A direct approach is to use a modifier — a keyword or symbol that defines an access level for an entity. Here are various possible forms:

private var x;
protected double x;
private protected var x;
package protected boolean x;
package local var x;
public array of int x;
x: public string;
-int x;
+int x;
-(int) x;
+(int) x;

These modifiers can be used to hide a class within a package, a member within a class, or just about anything within a module. Visibility of fields for example, can be restricted to

The object in which it appears
All objects of the same class in which it appears
The package in which it appears
Its class and all its class's subclasses
Its class and subclasses and the same package

...and it can have no restrictions at all (this generally uses the modifier public).

For languages that favor symbols over words, you could use something like

- for most-restricted (private)
# for semi-restricted or (protected or package-private or package-protected)
+ for least-restricted or (public)

A good case can be made for + and - ... not sure about others.

Implicit encapsulation

The other approach is to implement information hiding simply by the position of entity declarations. This is normally done by nesting, for example, local variables are almost always automatically private to a function (or method or procedure) body:

function f(x) {
    var y;
    ...
}

Parameters are usually private to a function too. We say "usually" because in languages that support named parameter association, calling code can see the parameter name, but it can only be used in the call, so it's plenty safe.

We can have variables that are local to a block, or an iteration statement:

for i in x..y
  # Some code using i
end

x.upto(y) do |i|
  # Some code using i
end

One interesting style is to have a module in which there is one section (usually called the interface) which holds the visible identifiers and another section (the implementation) holding private stuff. A sketch:

module Stacks
    type Stack(int capacity);
    vpid push(Object)
    Object pop()
    int size()
implementation
    Any variable defined here is private
    to this entire implementation section
    . . .
    Here we expand on the interface items
    type Stack(int capacity) is
        int top = 0; 
        Object[] data = Array(Object).new(capacity)
    end
    void push(Object x) 
        raise Overflow if top == capacity;
        data[top++] = x
    end
    Object pop()
        raise Underflow if top == 0;
        data[top--]
    end
    int size()
        top
    end
end

Hybrid Approaches

We can also have modules in which everything is private (hidden) by default — and an explicit export is needed.

module m
    export a;    // but not b
    int a, b;
    . . .
end

In some cases modules are completely closed off and you need an explicit import also.

int x, y;
. . .
module m
    import x;    // but not y
    export a;    // but not b
    int a, b;
    . . .
end

Reserved Words

2010-05-02T13:26:00.000-07:00

Many languages reserve words for their own use, meaning that you, the programmer, cannot use these words for your own purposes. Reserved words are related to keywords which are words used for certain purposes by the language, but which you can also use as you wish.

What is the motivation for a language to have reserved words? Are reserved words a good thing or not? Is there such thing as different degrees of being reserved?

Why reserve words at all?

A language might want to reserve words to limit confusion. The following are legal statements in PL/1, a language which chose not to reserve IF, THEN, ELSE, DO, and similar keywords:

IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;
IF IF THEN ELSE = IF; ELSE THEN = ELSE;
DO WHILE (WHILE = DO); END = WHILE + DO; END;

Would anyone really write code like that? Even if someone felt the need to abuse keywords like that, simply reserving them doesn't completely prevent programmers from writing confusing code in other ways. There's no stopping this:

int subtract(int x, int y) {
    return x + y;
}

Perhaps the real reason languages reserve words is to make things easier for the compiler, or to give the language authors an outlet in which to be funny. For example, the word goto is reserved in Java, even though it isn't used for anything. Of course, nothing prevents a Java programmer from using the word in package names, comments, and string literals. And, as Java is case-sensitive, there's always class Goto ... .

Why not reserve words?

There are a few reasons for not reserving words:

Programmers should have the freedom to choose their own identifiers. It's the compiler's job to figure out when words are used in a particular context.
It's annoying to programmers. There are thousands of Java programs in the world using variable and parameter names called klass, clazz, or c, because class is verboten. (Awesome hackers might even use clаss — haha, the third letter there is a Cyrillic small letter a, not a Latin one!)
As a language evolves, the addition of new reserved words will break old code.

Keywords versus symbols

Here are three alternative syntaxes for a certain statement:

if greater(x, y) and w then begin assign x plus 4 to z; print z; end;
if (x > y && w) {z = x + 4; print z}
(x > y && w ==> z = x + 4, !z)

The obvious things to notice are that too many keywords make code too verbose and readability goes down. Too few keywords gives you cryptic code. Most people understand + and * and /, but really, a ! for "print"? In fact, most symbols aren't obvious at first. Different languages use = differently. And what about these things: <=>. =~. %. ===, ^, *., #, ??, and who knows what else? Anyone use APL?

Symbols are sometimes preferred to keywords for another reason: if keywords are in a language you don't speak, they might be harder to learn than symbols. Programmers that know Spanish but not English may prefer 'y' to 'and' and 'mientras' to 'while'. This is one use for macros, I suppose. In C:

#define mientras while

Symbols as keywords

In ML, identifiers can be alphanumeric or symbolic. So +, <=*=> and ====> are all identifiers. You can define functions with these names. And you can define + to do subtraction. You can even do that in Ruby. Is this okay? The choices seem to be:

Allow symbolic identifiers, but treat some as reserved words so that no programmer can override such basic operators as +.
Allow symbolic identifiers but place no restrictions on them. After all, the reason languages allow "operator overloading" is to allow addition on vectors and other structured objects.
Disallow symbolic identifiers but permit operator overloading (for the fixed set of predefined symbolic operators).
Prohibit both symbolic identifiers and operator overloading.

How reserved is reserved?

When a language specification says that a word is reserved, it generally means one of two things:

The word cannot, under any circumstances, be used as a name for any programmer-defined entity (e.g., a variable, constant, function, type, object property name, method, function parameter, statement label, etc.)
The word can be used to name some programmer-defined entities, but its position in the source code text is restricted to simplify language processing (specifically, tokenization).

The former case may be stricter than necessary because quoting, for one thing, works pretty well for object properties (at least):

point["long"] = 118.3532;

You can't stop people from using reserved words in string literals, after all. Similar to quoting is escaping; for example, you would write something like :

var @while = 2;

The @ is not part of the variable name, it's just an escape character that tells the compiler the variable is really named 'while' but we're going to use it as an identifier and not as the beginning of a while statement.

The second case is interesting. JavaScript, for example, reserves the word "long" and while it allows

point["long"] = 118.3532;

it does not allow

point.long = 118.3532;

The probable reason for this is that it wants its lexical analyzer to be independent of context: for it to simply look that the above text and tokenize it into

    IDENTIFIER ("point")
    DOT
    KEYWORD_LONG
    ASSIGNMENT_OPERATOR
    NUMBER ("118.3532")

But is that really important in practice anymore? Lexical analyzers can easily choose a different meaning for a token if, say, the previous token is a dot! That would allow point.long. Now what about var point = {lat: 40, long: -110}? It would seem we'd have to allow the lexical analyzer to have access to the parser's state, but really, just wait until the semantic analysis phase and allow certain known keyword tokens to be just as good as identifiers when it comes to object property names.

Object Construction

2010-04-29T19:18:00.000-07:00

There are countless ways to build objects. Here are a few.

Classic Constructors

Many class-based object oriented languages have constructors that look suspiciously like methods. Unlike instance methods which operate on existing objects, though, these constructors bring new objects into existence, and generally use a special syntax (such as an operator called new) to distinguish their invocation from that of a method.

new Point("Pantages Theater", 34.1017, -118.3255, 
    "Hollywood", "Los Angeles", "CA", "us", "90028", 
    new LocalDate(1930, 6, 4), Status.ACTIVE)

If there are too many parameters, you can pass in a map (dictionary, associative array, hash), or your language might support named parameter association.

Setters

With setter methods, you first construct an object in which all of its fields are set to default values, then you customize the object by setting just those fields that differ from the defaults.

Point p = new Point();
p.setName("Pantages Theater");
p.setLatitude(34.1017);
p.setLongitude(-118.3255);
p.setNeighborhood("Hollywood");
p.setCity("Los Angeles");
p.setState("CA");
p.setCountry("us");
p.setPostalCode("90028");
p.setEstablished(new LocalDate(1930, 6, 4));
p.setStatus(Status.ACTIVE);

This approach solves the three problems above, but introduces a new problem: the object being constructed can't be immutable. How can you get around this?

Some languages allow fields to be marked "final" so that after the first set, they can't be set again.
In some languages, you can scope the setter methods so that only the parts of the code that are supposed to the construct the objects can see the setters. Since objects should be able to be constructed anywhere, this all but requires the programmer to factor out construction code into its module, which could be unwieldy. But there are ways to do it nicely in general; one way is the builder, described next.

Builders

Builder objects are common too. They have the three advantages of the setter approach (no need to supply non-default values, order doesn't matter, and reader knows what the arguments mean), while also allowing the newly constructed object to be immutable.

Point p = new PointBuilder()
    .name("Pantages Theater");
    .latitude(34.1017);
    .longitude(-118.3255);
    .neighborhood("Hollywood");
    .city("Los Angeles");
    .state("CA");    
    .country("us");
    .postalCode("90028");
    .established(new LocalDate(1930, 6, 4));
    .status(Status.ACTIVE)
    .build();

Factory Methods

A factory method is any regular method (or regular function) that creates and returns a new object. There's no special syntax here; all that is done is to call a constructor, builder, or other device inside the body of the factory method. In Java:

public class Card {
    private Suit suit;
    private Rank rank;
    private Card(Suit s; Rank r) {       // Constructor must be private
        suit = s;                        // to force all clients to use
        rank = r;                        // the factory method below
    }
    public static Card fromRankAndSuit(Rank r, Suit s) {
        return new Card(s, r);
    }
    . . .
}

Literals

In languages without classes, or when classes are not too important (say, when there is duck typing), object literals create new objects. Here's a JavaScript example. Fields are string literals, but we get to omit the quotes if the field name is a simple word that is not a reserved word:

{name: "Pantages Theater", lat: 34.1017, "long": -118.3255, 
    neighborhood: "Hollywood", "state or province": "CA",
    country: "us", "postal code": "90028", city: "Los Angeles", 
    established: new LocalDate("1930-06-04"), status: Status.ACTIVE)}

Object literals are also possible in languages with statically typed, closed classes. In Ada:

type Point is record
    X: Integer;
    Y: Integer;
end record;
. . .
var P: Point := Point'(6, 5);
var Q: Point := Point'(Y => 5, X => 6);

Construction in Prototypal Languages

Modern (ECMAScript 5-based) JavaScript has nice features for creating objects with a given prototype. I came up with this pattern:

/*
 * An immutable circle datatype.  Synopsis:
 *
 * var c = Circle.create(5, {x: 1, y: 4});
 * c.radius      => 5
 * c.center      => {x: 1, y: 4}
 * c.area()      => 25π
 * c.perimeter() => 10π
 */
var Circle = {
    prototype: {
        area: function () {return Math.PI * this.radius * this.radius;},
        perimeter: function () {return 2 * Math.PI * this.radius;}
    },

    create: function (r, c) {
        return Object.create(this.prototype, {
            center: {value: c},
            radius: {value: r}
        });
    }
};

The circles we create with Circle.create are immutable because we use JavaScript's default property descriptors for center and radius: they're not writable, not enumberable, we can't delete them, nor can we change these settings. The same isn't true for the Circle properties (prototype and create), nor for the prototype's own properties...is it worth using property descriptors on these too?

Global Variables

2010-04-17T11:37:00.000-07:00

The concept of a global variable seems pretty simple, especially to beginning programmers writing a single-file program in a block-structured language. But for languages with modules, or languages that support multithreading or multiprocessing, or with scripting languages embedded in other systems that expose "host" objects via script variables, things are not so clear.

Most programmers would warn that globals should be used sparingly, for various reasons:

Because they can be written to from anywhere, it's harder to tell what might be going on just by looking at local portions of code. Code is harder to debug.
Globals make multithreaded programming harder to reason about, and make race conditions more likely to pop up.
Parameters are almost always a more readable and understandable way to "share" data among routines.

How do we live with, or manage, global variables? What constructs can a language have to mitigate these problems?

Languages that Require Global Variables

There's an interesting distinction between languages that allow global variables and those that require them. If a language allows pieces of a program to be brought together from different physical components (such as files), but does not provide a programming language construct like a module or package, then global variables are basically required to share information. Even if you adopt a message-passing type of communication between functions in different files, without a nice module construct, those functions are referenced with global variables.

Modules

Many languages have modules for the express purpose of limiting the number of global variables and hence minimizing the number of name collisions. The simplest example is that of a math module, found in most modern languages. Instead of global variables for E, PI, sin, cos, tan, etc., these values are wrapped in a module construct (usually called Math), which is the only top-level name exposed to the rest of the code. Examples:

# Ruby-like
module Math
  def sin(x) ... end
  def cos(x) ... end
  PI = 3.141592653589793
  E = 2.718281828459045
end

// JavaScript-like
var Math = {
    sin: function (x) {...},
    cos: function (x) {...},
    PI: 3.141592653589793,
    E: 2.718281828459045
};

# Java-like
public class Math {
    private Math() {}
    public static double sin(double x) {...}
    public static double cos(double x) {...}
    public static double PI = 3.141592653589793;
    public static double E = 2.718281828459045;
}

Modules create a namespace for these values which would otherwise be global. If there is a chance of having too many modules (that is, a possibility of having names of modules collide), modules can be grouped into some higher level named collection (Java calls this a package). Whether or not the additional construct is used, one can always use a hierarchical naming convention for global entities, starting, for instance, with a (reversed) DNS name that the programmer "owns."

Multithreading

The dangers of unprotected global variables in multithreading code (e.g., race conditions, lost updates) are so well known we don't need to explain them here. If you are stuck with a global, shared, variable for communication between threads, your language might provide some mechanism for atomic updating. For example:

atomic var balance = 0;

AtomicInteger balance = new AtomicInteger(0);

Internet Explorer Events

In most event-driven systems, an event object is passed to event handlers. If you are using JavaScript, and are writing event handlers to function on any browser except Internet Explorer, this will be the case. You write code like:

document.body.onclick = function (e) {
    alert("Clicked at (" + e.clientX + "," + e.clientY + ")");
}

With Internet Explorer, the event object is not passed to your handler; instead, the most recently fired event is accessed by a global variable (no kidding!) called event. IE gets away with this because client-side JavaScript is single-threaded: all events are queued and are handled sequentially. To make your code work on multiple browsers you can use the following idiom:

document.body.onclick = function (e) {
    if (!e) e = event;
    alert("Clicked at (" + e.clientX + "," + e.clientY + ")");
}

Sigils

Because local variables are usually better than global variables, one interesting design choice is to make locals the default and make globals ugly. Ruby does exactly this: global variables begin with a $ character. That's a sigil — a character used within a variable name to indicate its type, category, or (in this case) scope.

Example:

$x = 10
def f
  x = 3
  puts x
  puts $x
end
def g
  puts $x
end
def h
  puts x
end
f        # writes 3 then 10
g        # writes 10
puts $x  # writes 10
h        # raises a NameError

Another example:

x = 5
def f
  puts x
end
f        # raises a NameError

Hiding in Closures and Anonymous Functions

TODO

Lambdas

2010-03-13T09:45:00.000-08:00

Lambda-expressions, or anonymous functions, are common in many languages, but are either missing or approximated in others.

Simple Expressions

There are many ways to write lambda expressions when the body is a single expression:

λ x . x + 1
{x -> x + 1}                 // Curly braces are common
[x -> x + 1]                 // Not so common, looks like an array
(x -> x + 1)                 // Let the -> operator imply function type
fn x => x + 1                // ML
lambda (x) x + 1
(lambda (x) (+ x 1))
sub {$0 + 1}                 // Perl
{$0 + 1}
{_ + 1}                      // Scala
λ x y . 2 * x + y
{x y -> 2 * x + y}
fn x y => 2 * x + y
fn (x,y) => 2 * x + y
lambda (x, y) 2 * x + y
sub {2 * $0 + $1}
{2 * $0 + $1}

Some of these require a bit more lookahead to parse than others.

Complex Function Bodies

If your language has a sequencing expression, like C's comma expression, where e1, e2, e3 means to evaluate each expression in order and return the value of the last expression encountered, then you can use the sequencing expression in any of the forms above. Operator precedence becomes the only real issue, but parentheses can make it clear exactly what the function is.

{ p, i -> k = p * 0.45359237, m = 0.0254, k / (m * m) }

Let-expressions work well here, too

{ p, i -> let k = p * 0.45359237 and m = 0.0254 in k / (m * m) end }

Lambdas can be arbitrarily complex, with return statements. Some languages allow you to omit the return statement if the expression to return is the last one in the sequence.

TODO

Recursion

To arrange for an anonymous function to be called recursively, it needs to have a name or the language should reserve a special keyword to refer to the currently executing function (watch for scoping rules here).

{[f] n -> n <= 1 ? 1 : f(n - 1) * n}

{n -> n <= 1 ? 1 : CALLEE(n - 1) * n}

Allowing recursion in lambda expressions is far better than requiring the applicative Y combinator.

Closures

Lambdas often refer to variables in outer scopes. Within the lambda expression, these outside variables are called free variables.

function incrementer(int start, int delta) {
    var current = start - delta;
    return function () {return current += delta;}
}

Calling Lambda Expressions

Calling, or applying a function, is normally done with juxtaposition, though again precedence rules will often come into play.

(x -> x + 1)7
(x -> x + 1)(7)
[x -> x + 1](7 * y - 2)
[x -> x + 1] 7 * y - 2         // application or multiplication first? 
fn (x, y) => 2 * x + y (7, y)  // too confusing 
(fn (x, y) => 2 * x + y)(7, y)

Currying

Sketches:

[x -> [y -> 2 * x + y]]
[x -> y -> 2 * x + y]
[x y -> 2 * x + y]
fn x => fn y => 2 * x + y
fn x y => 2 * x + y

Fake Lambdas

TODO

Delegates

TODO

Prototypes

2010-03-03T18:43:00.000-08:00

There are two primary ways that languages allow you to create a set of objects with the same (or similar) structure and behavior. In the classical approach, you define classes and then create objects from constructors associated with the class; in this case, objects are instances of the class. In the prototypal approach, you create a prototype object, and then instantiate new objects based on the prototype. Generally, a language is either classical (e.g., Java, Ruby, C#), or prototypal (e.g, JavaScript, Self, Io, Lua), though some languages let you construct objects both ways (e.g., ActionScript, Falcon).

What might prototype-based instantiations look like?

Simple Differential Inheritance

Perhaps the purest approach to a prototypal mechanism is one in which every object has an implicit link to another object, called its prototype. Whenever we look up a property in an object and it is not present, we go look in its prototype. If the property isn't present in the prototype, we search the prototype's prototype, and so on. (Writing a property is different, however; if the property doesn't exist in the object it is created right there.) Such a language might have an operator called beget which might be used like this:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = beget protoCircle;
c1.x = 4;
c1.color = "green";

Here we have c1.x == 4, c1.y == 0, c1.radius == 1, and c1.color == "green".

We can create more circles:

var c2 = beget protoCircle;
c2.x = 4;
c2.radius = 15;
c3 = beget protoCircle;

which gives us this:

If we change any of the fields in the prototype, this change is picked up at run time by all the derived objects that haven't overridden those fields. Ditto for adding new fields to the prototype. To some people, this is convenient and flexible; to others it is ugly and error-prone. It's all those things, of course.

The prototype is also a great place to store function properties:

protoCircle.area = function () {return π * (this.radius ** 2);}

Here we've assumed we have a "this" operator which refers to the object through which the function was called (not necessarily the one along the prototype chain which ultimately contained the function).

Direct Cloning

Another approach is to define a prototype with default field values then clone it to make new objects. Cloning is different from simple differential inheritance because with cloning the new object automatically gets copy of all of the fields in the prototype, even for those that you wanted to keep the default values in the prototype. A clone operator might look like this:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = clone protoCircle;
c1.x = 4;
c1.radius = 8;
c1.color = "green";

Cloning means c1 is going to get all four fields:

Cloning via user-defined functions

If a language does not have a built-in clone operator, or you just don't like the look of creating an object and then doing multiple assignments to override defaults, you can do the cloning and field assignments in a function. Default fields can be nicely managed with default parameters, in languages that have them:

function newCircle(x = 0, y = 0, r = 1, c = "black") {
    return {x: x, y: y, radius: r, color: c};
}
var c1 = newCircle(x => 4, r => 8, color => "green");
var c2 = newCircle(color => "yellow", r => 5);
var c3 = newCircle(_, 5, _, "blue");

or with a short circuit disjunction if you have that but no default parameters:

function newCircle(x, y, r, c) {
    return {x: x || 0, y: y || 0, radius: r || 1, color: c || "black"};
}
var c1 = newCircle(x => 4, r => 8, color => "green");

With cloning, we have the problem of where to put the objects' methods. How about global variables?

function circleArea(c) {return π * c.radius ** 2;}
function circlePerimeter(c) {return 2 * π * c.radius;}

That's awful! It pollutes the global namespace. What about putting them inside the objects?

function newCircle(x = 0, y = 0, r = 1, c = "black") {
    return {
        x: x, y: y, radius: r, color: c,
        area: function () {return π * this.radius ** 2;}
        perimeter: function () {return 2 * π * this.radius;}
    };
}

That's no good either — every time a circle object is created, two additional function objects are created for its area and perimeter. If we create thousands of circles we end up with thousands of identical area functions and thousands of perimeter functions sucking up memory. We should make sure we have only one copy of each (without defining global variables to access them). We want that nice little implicit link to the prototype back.

JavaScript Constructors

In JavaScript, prefixing a function call with new calls the function in a context in which the this expression refers to a newly constructed object, and, unless a return statement explicitly returns some other object, returns this newly constructed object.

function Circle(x, y, r, c) {
    this.x = x || 0;
    this.y = y || 0;
    this.radius = r || 1;
    this.color = c || "black";
}

It's also the case that JavaScript is prototypal, and weirdly so: every function is itself an object, with a property called prototype, whose value is the object that will be used as the prototype of every object created with that function. As usual, the prototype is the place where you drop in shared properties, like functions. Here's an example:

function Circle(x, y, r, c) {
    this.x = x || 0;
    this.y = y || 0;
    this.radius = r || 1;
    this.color = c || "black";
}
Circle.prototype.area = function () {return Math.PI * this.radius * this.radius;}
Circle.prototype.perimiter = function () {return Math.PI * 2 * this.radius;}
var c1 = new Circle(3, 8, 4, "green");
var c2 - new Circle(1, 7);
var c3 = new Circle();

Better Prototypal Patterns for JavaScript

It's pretty clear that JavaScript's new operator and the associated semantics for construction were designed to make the langusge look more classical. Douglas Crockford characterizes this beautifully by saying that "JavaScript is conflicted about its prototypal nature." He shows how to make JavaScript implement simple differential inheritance:

if (typeof Object.beget !== 'function') {
    Object.beget = function(o) {
        var F = function () {};
        F.prototype = o;
        return new F();
    };
}

Our running circle example can now be written like this in JavaScript:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = Object.beget(protoCircle);
c1.x = 4;
c1.color = "green";

One of the nice things about using beget is we can hide away JavaScript's crazy new operator and never have to use it again. This operator is fairly confusing and error-prone. For example, when you call functions that are supposed to be called with new but aren't, the code ends up creating or overwriting global variables. Object.beget is a nicer approach.

ECMAScript 5 Improvements to JavaScript

Current JavaScript versions (as of mid-2010) are based on the 3rd Edition of ECMAScript, known as ECMAScript 3. The newly standardized ECMAScript 5 adds many properties to the Object object, including create, which is like the beget example above. In new JavaScript:

var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = Object.create(protoCircle, {
    x: {value: 4}, 
    color: {value: "green"}
});

Much better.

Object-Specific Methods

2010-02-27T16:52:00.000-08:00

Sometimes a programmer likes to attach a method to a single object, instead of attaching it to all objects of a class. How can this be done?

We'll make three dogs, Sparky, Spot, and Spike. All dogs bark, but only Spike can bite.

Classless Languages

In JavaScript, just add function properties directly to the object:

var Dog = function (name) {this.name = name;};
Dog.prototype.bark = function () {return "woof";};
var d1 = new Dog("Sparky");
var d2 = new Dog("Spot");
var d3 = new Dog("Spike");
d3.bite = function () {return "GOTCHA";};

It is hard to imagine improving on that.

Singleton Classes

In Ruby, you add the method to the object's singleton class:

class Dog
  def bark(); "woof"; end
end
d1 = Dog.new("Sparky");
d2 = Dog.new("Spot");
d3 = Dog.new("Spike");
class <<d3
  def bite(); "GOTCHA"; end
end

The use of "<<" isn't as readable as it could be, at least for English speakers, so might there be other ways to do this for classes languages?

Object.add_method(d3, "bite", {"GOTCHA"});

d3.add_method("bite", {"GOTCHA"});

def d3.bite(); "GOTCHA"; end;

Regardless of the syntax, with a class-based language, each object will have an internal reference not only to its class, but to its bundle of object-specific methods.

Subclasses

In a language where classes are closed, you "extend" classes by making new subclasses. This seems a little uglier in our contrived example. In Java:

class Dog {
    private String name;
    public Dog(String name) {this.name = name;}
    public String bark() {return "woof";}
}
class BitingDog extends Dog {
    public BitingDog(String name) {super(name);}
    public String bite() {return "GOTCHA";}
}
Dog d1 = new Dog("Sparky");
Dog d2 = new Dog("Spot");
BitingDog d3 = new BitingDog("Spike");

This may be clunky, but it does make a lot more static analysis possible.

Bracketing

2010-02-11T22:02:00.001-08:00

At some point in the history of programming languages, the idea of code as lists of statements with jumps or gotos gave way to the idea of nested (hierarchical) declarations and control structures, as in:

So programs are really trees, but we still need a string representation for storage and transmission. How can we capture trees in text? There are two main approaches. The first is to just give a direct representation of the syntax tree. The other is to be creative with syntax, making something that a human would like to read.

Here are several sketches of representations of the above tree.

S-Expressions

The usual direct representation of trees in strings is that of "S-expressions" — parenthesized forms such as (A B C D) where A is the root and B, C, and D are A's (ordered) children. B, C, and D can be primitive values or trees.

(function f (params x y)
    (block 
        (var b 3)
        (if (< x (* a (+ b 7)))
            (while found
                (block
                    (for i (range 2 5) (print i))
                    (if (== a 2) (break))))
            (= p (cos 5)))
        (return (/ a 5))))

Languages like LISP, Scheme, and Clojure use S-expressions (though with a few little extensions here and there).

In the example above, some of the objects aren't actually evaluated, so the distinction between symbols and values sometimes comes up. Occasionally this is made explicit in the syntax, for example:

(function 'f (params 'x 'y)
    (block 
        (var 'b 3)
        (if (< x (* a (+ b 7)))
            (while found
                (block
                    (for 'i (range 2 5) (print i))
                    (if (== a 2) (break))))
            (= p (cos 5)))
        (return (/ a 5))))

Instead of the quote character for symbols, the colon or ampersand (or other character) could be used.

XML Style

The tree can be given directly in XML, too. This is probably too verbose, not only because we have some markup characters, but because we have to invent elements for leaf nodes.

<function name="f" params="x,y">
  <block>
    <var id="b"><intlit value="3"/></var>
    <if>
      <less>
        <ref id="x"/>
        <times>
          <ref id="a"/>
          <plus><ref id="b"/><intlit value="7"></plus>
        </times>
      </less> 
      <while>
        <ref id="found"/>
        <block>
          <for var="i">
            <range><intlit value="2"/><intlit value="5"/></range>
            <print><ref id="i"/></print>
          </for>
          <if>
            <eq><ref id="a"/><intlit value="2"/></eq>
            <break/>
          </if>
        </block>
      </while>
      <assign>
        <ref id="p"/>
        <cos><intlit value="5"/></cos>
      </assign>
    </if>
    <return>5</return>
  </block>
</function>

Reader exercise: Could this have been done without the ref and intlit elements?

JSON Style

Another form of direct tree representation is an encoding in JSON. This gets pretty verbose as well because in JSON the properties are unordered, so naming them seems to be the most reasonable approach. And again, we have to name everything:

{type: "function", name: "f", params: ["x", "y"], body: {
    type: "block", statements: [
        {type: "var", id: "b", init: 3},
        {type: "if", 
            condition: {type: "<", left: "x", right: {
                type: "*", 
                left: "a", 
                right: {type: "+", left: "b", right: 7}
            }}, 
            truePart: {type: "while",
                condition: "found", body: {type: "block", statements: [
                    {type: "for", var: "i", 
                        range: {type: "range", low: 2, high: 5}, 
                        body: {type: "print", args: ["i"]}
                    },
                    {type: "if",
                        condition: {type: "==", left: "a", right: 2},
                        truePart: {type: "break"}
                    }
                ]}
            }, 
            falsePart: {type: "assign", 
                target: "p", 
                source: {type: "cos", args: [5]}
            }
        },
        {type: "return", value: 5}
    ]
}}

Indentation

Okay, let's move away from direct representations and use some real syntax. By real syntax we mean cleverly arranging symbols in a pleasing and readable way but still having it be possible to unambiguously recapture the tree structure.

function f (x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            if a == 2
                break
    else
        p = cos(5)
    return a / 5

Curly Braces

The use of curly braces to define subtrees is, for some reason, very popular.

function f (x) {
    var b = 3;
    if (x < a * (b + 7)) {
        while (found) {
            for (i in 2..5) {
                print i;
            }
            if (a == 2) {
                break;
            }
        }
    } else {
        p = cos(5);
    }
    return a / 5;
}

Terminating Keywords

Rather than marking the beginning and end of a block, as is done with the curly brace style and its equivalents, we can get away with just marking the end. This can be done simply with the word end:

function f (x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            end
            if a == 2
                break
            end
        end
    else
        p = cos(5)
    end
    return a / 5
end

or by spelling the initial word backwards (yes, many languages do this):

function f (x, y)
    var b = 3
    if x < a * (b + 7)
        while found
            for i in 2..5
                print i
            rof
            if a == 2
                break
            fi
        elihw
    else
        p = cos(5)
    fi
    return a / 5
noitcnuf

Postfix Style

Here's a sketch using the postfix style. Here it's definitely important, I think, to distinguish symbols from variables that need to be evaluated. I used square brackets to bundle code, as should be obvious:

[
    'x' param 'y' param
    3 'b' var
    x a b 7 + * <
    [
        found
        [
            'i' 2 5 [i print] for
            2 a == [break] [] if
        ]
    while]
    [5 cos p assign]
    if
    a 5 / return
] 'f' function

Functional Style, with 'let' and Infix Operators

Here's a style that borrows from Standard ML. Block scope is introduced with the let...in...end expression, and expression sequences use semicolons as separators. Expression sequences are allowed in the body of a let, or wherever a regular expression can go, provided the sequence is enclosed in parentheses.

function f (x, y) =
    let 
        b = 3 
    in
        if x < a * (b + 7) then
            while found do (
                for i in 2..5
                    print i
                ;
                if a == 2 then
                    break
                else
                    nil
            )
        else
            p = cos(5)
        ;
        a / 5
    end

Else ifs

2010-02-08T17:50:00.001-08:00

You may have heard of the so-called dangling else problem. The text

if e1 then if e2 then s1 else s2

can reasonably mean either of two possible things:

When mixed with other statements without any obvious end, things can get crazy:

if e1 then while e2 do if e3 then s1 else s2

The dangling-else problem arises in languages with ambiguous grammars, such as the Pascal-like:

Stmt      = Assignment | IfStmt | WhileStmt | Block | ...
IfStmt    = "if" Exp "then" Stmt ("else" Stmt)?
WhileStmt = "while" Exp "do" Stmt
Block     = "begin" Stmt* "end"

or the C-like:

Stmt      = Assignment | IfStmt | WhileStmt | Block | ...
IfStmt    = "if" "(" Exp ")" Stmt ("else" Stmt)?
WhileStmt = "while" "(" Exp ")" Stmt
Block     = "{" Stmt* "}"

There are many ways to deal with this problem, but most lead to rather unsatisfactory ways to handle the general multi-way conditional. Is there a design that avoids dangling else problems and forces programmers to write nice multiway conditional statements?

Resolve with a semantic rule

In Pascal and C, the syntax is ambiguous, but a semantic rule says that each else must match the closest if, regardless of indentation. This can be terribly confusing to beginners; the code below on the left doesn't mean what novices might think:

(* Without blocks, 'else'
tied to second 'if' *)
if cond1 then
    if cond2 then
        stmt1
else (* Sorry! Goes w/ cond2! *)
    stmt2

(* MUST use block to tie
'else' to first 'if' *)
if cond1 then begin
    if cond2 then
        stmt1
end else (* Now it goes w/ cond1. *)
    stmt2

One objection to this approach is that purists will probably hate how this solution relies on a semantic rule to handle something so inherently structural. A second objection is that multiway conditionals don’t really exist—the syntax can only describe deeply nested individual conditionals. So programmers end up playing games with their indentation to simulate multiway branching:

if cond1 then
    stmt1
else if cond2 then
    stmt2
else if cond3 then
    stmt3
else if cond4 then
    stmt4
else
    stmt5

if cond1 then begin
    stmts1
end else if cond2 then begin
    stmts2
end else if cond3 then begin
    stmts3
end else if cond4 then begin
    stmts4
end else begin
    stmts5
end

This is ugly—but it could be a lot worse: programmers often mix up their conditional arms by sometimes using blocks and sometimes not, and get the ugly code they deserve. The worst part of the “semantic rule solution” dangling else syntax is that there seem to be hundreds of different ways to format multiway if statements. Take a look at some old Pascal textbooks if you want to be grossed out.

Complicate the grammar to remove the ambiguity

It is possible, with a lot of work, to force all elses to be paired with the nearest then. Call an open statement one with a sub-statement that dangles at the end, and a closed statement one that is not open. Then:

Stmt           = OpenStmt | ClosedStmt
OpenStmt       = IfThenStmt | IfThenElseStmt | WhileStmt | ForStmt
ClosedStmt     = Assignment | Call | BreakStmt | ReturnStmt
NoShortIfStmt  = ClosedStmt | NoShortIfIfThenElse
               | NoShortIfWhile | NoShortIfFor
IfThenStmt     = "if" Exp "then" Stmt
IfThenElseStmt = "if" Exp "then" NoShortIfStmt "else" Stmt
NoShortIfIfThenElse = 
    "if" EXP "then" NoShortIfStmt "else" NoShortIfStmt
NoShortIfWhile = "while" Exp "do" NoShortIfStmt
NoShortIfFor   = "for" Var "in" Range "do" NoShortIfStmt
etc.

This grammar is no longer ambiguous, and every else is automatically connected to the nearest if. But it does nothing to prevent the confusion arising from misindented code we saw above.

Require the else part

If there always has to be an else part, there is never any ambiguity, but you’ll have funny looking code when your else part is missing: you'll need some empty statement or a special null keyword.

if cond1 then
    if cond2 then
        stmt1
    else
        null
else
    stmt2

if cond1 then
    if cond2 then
        stmt1
    else
        stmt2
else
    null

Not pretty. It works, though, even if the while statement (and other compound statements) aren’t fully bracketed, for example:

if e1 then while e2 do if e3 then s1 else null else s2

and

if e1 then while e2 do if e3 then s1 else s2 else null

Require bracketing

Ultimately, languages that have compound statements where the bodies are “trailing statements” leave too many formatting choices and look ugly and unbalanced to many people. Defining a language’s syntax to require bracketed compound statements is a Good Thing for that reason, and as a bonus it removes the dangling else problem completely. The idea is that a block should not be a kind of statement, and that all compound statements use blocks for bodies, never simple statements! This grammar fragment:

Stmt      = Assignment | IfStmt | WhileStmt | ...
IfStmt    = "if" "(" Exp ")" Block ("else" Block)?
WhileStmt = "while" "(" Exp ")" Block
Block     = "{" Stmt* "}"

yields code like:

if (cond1) {
    if (cond2) {
        stmts1
    } else {
        stmts2
    }
}

if (cond1) {
    if (cond2) {
        stmts1
    }
} else {
    stmts2
}

Bracketing can also be done with just a terminating end instead of curly braces or begin-end pairs:

Stmt      = Assignment | IfStmt | WhileStmt | ...
IfStmt    = "if" Exp "then" Stmt+ ("else" STMT+)? end
WhileStmt = "while" Exp "do" Stmt+ "end"

yielding rather clean code like this:

if cond1 then
    if cond2 then
        stmts1
    else
        stmts2
    end
end

if cond1 then
    if cond2 then
        stmts1
    end
else
    stmts2
end

But, wait—if we really require bracketing, won’t that make multiway conditionals ugly? Either you start indenting too much or you get a bunch of }s (or ends) at the end. Like this, right?

if cond1 {
    stmts1
} else {
    if cond2 {
        stmts2
    } else {
        if cond3 {
            stmts3
        } else {
            if cond4 {
                stmts4
            } else {
                stmts5
            }
        }
    }
}

if cond1 {
    stmts1
} else { if cond2 {
    stmts2
} else { if cond3 {
    stmts3
} else { if cond4 {
    stmts4
} else {
    stmts5
}}}}

Make indentation matter

In Python, there is no dangling else problem since the indentation makes things clear:

if cond1:
    if cond2:
        stmts1
else
   stmts2

if cond1:
    if cond2:
        stmts1
    else
        stmts2

But, wouldn't required indentation lead to funny looking code in multiway conditionals?

if cond1:
    stmts1
else:
    if cond2:
        stmts2
    else
        if cond3:
            stmts3
        else
            if cond4:
                stmts4
            else
                stmts5

While the code above is legal Python, real Python programmers use the next solution.

Introduce a new keyword to simplify required bracketing or indentation

In most languages where bracketing (or indentation) is required, the multiway conditional is described syntactically as a single if-statement, usually with the help of a special keyword (called elsif in Ada, Ruby, and Perl; elif in bash and Python; and elseif in PHP. We have both a curly-brace style:

IfStmt =  "if" "(" Exp ")" Block
          ("elsif" "(" Exp ")" Block)*
          ("else" Block)?
Block  =  "{" STMT* "}"

and terminating-end style:

IfStmt = "if" EXP "then" Stmt +
          ("elsif"  Exp "then" Stmt+)*
          ("else" Stmt+)?
          "end"

Code looks like this:

# Ruby
if cond1
  stmts1
elsif cond2
  stmts2
elsif cond3
  stmts3
elsif cond4
  stmts4
else
  stmts5
end

# Python
if cond1:
    stmts1
elif cond2:
    stmts2
elif cond3:
    stmts3
elif cond4:
    stmts4
else
    stmts5

// PHP
if (cond1) {
    stmts1
} elseif (cond2) {
    stmts2
} elseif (cond3) {
    stmts3
} elseif (cond4) {
    stmts4
} else {
    stmts5
}

Lisp’s COND

Because Lisp is basically written in abstract syntax trees, you won’t ever have a dangling else, and the multiway conditional is already handled by COND:

(COND
  (condition1 block1)
  (condition2 block2)
  (condition3 block3)
  (condition4 block4)
  (T block5))

Interestingly, Erlang uses the same idea, though with a terminating-end syntax. There are no explicit elses; Erlang’s if is inherently multiway:

if
  condition1 -> body1;
  condition2 -> body2;
  condition3 -> body3;
  condition4 -> body4;
end.

Require bracketing but accept lookahead

There is a way to require bracketing without resorting to special words like elsif or elif and without ending up with a whole mess of terminators at the end. The solution is amazingly simple:

IfStmt = "if" "(" Exp ")" Block
         ("else" "if" "(" Exp ")" Block)*
         ('else' Block)?
Block  = "{" Stmt* '}'

I haven't seen this before. Why is this not popular? The only thing I can think of is that top-down parsers will need a two-token lookahead when encountering an else. Why should this be a big deal? Just peek ahead to see if the next token is an if (or a {).
How can we do this for a terminating-end syntax? This doesn't work very well:

IfStmt = "if" Exp "then" Stmt+
         ("else" "if" Exp "then" Stmt+)*
         ("else" Stmt+)?
         "end"

because the amount of required lookahead is infinite. What about rejecting 'if' statements in the final else-part?

Stmt   = IfStmt | NonIfStmt
IfStmt = "if" Exp "then" Stmt+
         ("else" "if" Exp "then" Stmt+)*
         ("else" NonIfStmt Stmt*)?
         "end"

There we go! As long as NonIfStmt cannot be empty and cannot start with if, we can parse top-down with a lookahead of 2.

Novel syntactic forms

If the objection to elsif and related words is just that they are made up, we could give the if-statement a make over, using more reasonable words, or symbols, even::

when cond1 do
    stmts1
or when cond2 do
    stmts2
or when cond3 do
    stmts3
or when cond4 do
    stmts
or else do
    stmts
end

try cond1 do
    stmts1
or cond2 do
    stmts2
or cond3 do
    stmts3
or cond4 do
    stmts
otherwise
    stmts
end

try [ cond1 ] =>
    stmts1
[ cond2 ] =>
    stmts2
[ cond3 ] =>
    stmts3
[ cond4 ] =>
    stmts
[] =>
    stmts
end