Programming Language Design Sketchbook

2010-06-28

REST

REST, or representational state transfer, is a set of architectural principles used in the design of the world wide web, that have proven the key to the web's efficiency and scalability. What are these principles, and how can they be used in the design of programming languages?

Architectural Constraints

REST was described by Roy Fielding in his dissertation. He called out the following set of architectural constraints:

Client-Server — separates UI from data storage
Stateless Server — improves reliability and scalability
Client Cache — reduces some network traffic
Uniform Interface — decouples implementations from the services they provide
Layered System — means that each component only be concerned with those just below or just above it
Code-on-Demand — allows client functionality to be extended by downloading applets or scripts

Principles of REST

A RESTful system is characterized by

Addressable Resources
- Every resource has a (unique) identifier (e.g., URI)
- Example: http://www.yourcompany.com/products/2443
- NOT http://www.yourcompany.com/getProduct?id=2443
- Resources are separate from their representations
Representation-Orientation
- Requests for resources return representations
- The representation contains the state of the resource
- The client can update or delete resources through their representations
Uniform, Constrained Interface
- Small, fixed number of verbs
- Example: In HTTP, GET, PUT, POST, DELETE, HEAD, OPTIONS, TRACE only!
- An example of a non-constrained interface is an RPC-based protocol (e.g., SOAP, CORBA)
Self-Descriptive Messages
- Messages include all the information necessary to be processed
- Example: Internet Media Types
Stateless Server
- With no shared context on the server, client requests are independent
- Servers can be simpler, "more" multithreaded, easier to monitor, reliable, scalable, etc.
Cacheability
- Responses indicate whether the client can cache the data or not
- Obviously can cut down on network traffic if the client knows it doesn't have to ask for the same data again
HATEOAS
- Hypermedia as the engine of application state
- Representations should include links to related resources

Can we apply any of these principles to programming language design?

System-wide unique identifiers

The idea of identifiers having wide scope, whether unique across a system or unique across the entire world (such as URIs) exists in many forms: UUIDs, MAC addresses, public IP addresses, etc. In programming languages, Java package names are encouraged to be globally unique by using a (reversed) DNS name over which you have control.

Resource/Representation Separation

A limited, fixed, set of verbs

Self-descriptive content

Statelessness

Cachability

Linkage of Representations

2010-06-22

Eval

A common feature in many scripting languages is an eval function. Pass it a string representing source code, and the function will compile it and then interpret it. This function is called eval if the source code string represents an expression that produces a value, and might be called exec if it does not.

Eval is controversial and much-maligned because it is (1) so often abused, (2) so often used when much better alternatives exist, (3) is slow, and (4) can lead to disaster if misused. It is not rare to hear the claim "eval is evil." So the questions are: should a language allow eval and if so, how should this feature be designed?

Problems with eval

Things to watch out for:

Eval is slow because a compiler has to be launched to lex and parse the code string, prior to evaluating it.
Eval is lame because programmers may sometimes use eval without thinking. While it would be rare for anyone to blatantly pass a fixed string to eval, such as
```
eval("x = 3;");
```
one can usually get away with creating an anonymous function.
Eval is a security hole. Since the only reason for using eval is to run code that is supplied at run time, it's possible that this code may come from an untrusted or malicious source. Allowing just anyone to run code on your own machine is crazy.

When is eval not evil?

If you are going to use eval, the string must be completely sanitized. You need to check it for infinite loops, assignments, or calls with side-effects that might destroy the integrity of your application. For example, here is a JavaScript application that accepts an arithmetic expression from a user and evaluates it, first checking to make sure the input consists only of digits, parentheses, and the four basic arithmetic operators:

var s = prompt("Enter a numeric formula");
if (/[^\d()+*/-]/.test(s)) {
    alert("I don't trust that input");
} else {
    alert(eval(s));
}

Eval and Language Design

Either a language will allow evaluation of code strings or it will not. If it does, we can provide that functionality through a function or an operator. Because it is dangerous, it is definitely a candidate construct for being required to appear inside a "warning" construct, or similarly, disallowed from strict modes. Examples:

UNSAFE module Calcuator {
    ....
    // use eval here
    ....
}

use module UNSAFE;
application Calcuator {
    ....
    // use UNSAFE.eval here
    ....
}

Eval as a function

Generally, we see eval as a global function, or a member of a module entitled something like Kernel, System, or perhaps better, UNSAFE.

Eval as an operator

Another way to call attention to the use of eval is to simply make it a unary operator on strings. Perhaps:

var s = prompt("Enter an expression");
alert( `` s );

Alternatively, we might see the string to be evaled enclosed in angle or other brackets.

Disallowing Eval

Clearly we could imagine a language without an eval function or operator. C is one such language.

Forced Sanitization

An interesting new idea is for the eval function to take a second parameter which is (1) a block or anonymous function that performs sanitization, or (2) a regex which will be applied to the string so that the string will only be applied if the regex pattern matches. Example:

var s = prompt("Enter a numeric formula");
alert(eval(s, /[^\d()+*/-]/));

Throwing an exception on not matching would likely be the best course of action here.

2010-06-06

Regular Expressions

A regular expression (or regex, regexp, r.e.) is a pattern that matches a string or a portion of a string. They are used for validation, search, and find/replace. You'll find them in Java, JavaScript, Ruby, Python, Perl, and dozens of other popular languages. All languages agree on a common notation core, e.g. square brackets for character classes, ? for zero-or-one, * for zero or more, + for one-or-more, parentheses for grouping and capturing, and so on.

There's room for some notational innovations here, though. We'll cover a little of what's already in use and sketch some new ideas, too.

Examples from the Common Notation

Virtually all languages will interpret these the same way:

hello
gray|grey
gr(a|e)y
gr[ae]y
colou?r
go*gle
go+gle
g(oo)+gle
z{3}
z{3,6}
z{3,}
[Bb]rainf\*\*k
\d{5}(-\d{4})?
1\d{10}
[2-9]|[12]\d|3[0-6]
Hello\nworld
b..b
\d+(\.\d\d)?
^dog
dog$
^dog$
sh[^i]t
\d+(\.\d+([Ee][+-]?\d+)?)?
https?://(?:www\.)citysearch\.com/profile/[^/]+/(\d+)/?

Metacharacters and Escapes

Most characters stand for themsleves in a pattern. The ones that don't are called metacharacters. Most languages have 14 metacharacters (some more, some less). The usual ones are:

    |    ?    *    +    .    ^    $
    (    )    [    ]    {    }    \

If you want to use a metacharacter as a regular character, you have to escape it, e.g. \+, \[, \\, etc. The backslash also introduces dozens of other simplified expressions:

...
... TODO ...
...

Defining Patterns

How do you write a regex, such as \d+(\.\d+)? in program text? How about in a string literal?

'\d+(\.\d+)?'

Maybe.... The thing is most languages use \ to escape within a string literal. Perl and Ruby don't make the backslash special if the string literal has single quotes, but they do for double quotes, like most languages. In that case, you have to write:

"\\d+(\\.\\d+)?"

And if you need to write a regex that has to match a backslash character, the regex is \\, which looks like this in a string literal:

"\\\\"

But a regex is not a string: it has to be compiled into a little program that matches:

var r = re.compile("\\d+(\\.\\d+)?")
r = "\\d+(\\.\\d+)?".toregex
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");

But since regexes are so important, and that backslash issue gets so annoying, we'd like a language to have a special syntax for describing regexes—something that automatically compiles once, and frees us from doubly escaping backslashes. Common notations include Ruby's %r
and slash delimiters:

%r{\d+(\.\d+)?}
/\d+(\.\d+)?/

The problem with the slash approach is that you have to escape any / characters you want in the regex itself. The %r approach has so such trouble; the language's parser can figure out which is the terminating } easily.

Matchers

To use a regex you match it against some text. This allows you to

Validate that the text conforms to a desired pattern
Find (and extract) the parts of the text that do conform
Replace matching portions of the text with something else

So these matches are stateful ... TODO ... Once you have a (compiled, immutable) pattern, you use it by creating a (stateful) matcher.

Character Classes

Quantifiers

Groups and Capturing

Backreferences

Boundaries

Lookaround

Modifiers

2010-05-20

List Comprehensions

A list comprehension is a compact syntactic form for a list (or array) that is built from applying an operation to certain elements of an existing list (or array). For example:

[7*x-3 | x in 1..n | x mod 3 == 1]

means the list of all values made by computing 7 * x - 3, for each x in 1 through n where x mod 3 is 1. If n is 10, for example, the generated list is [4, 25, 46, 67]. The three parts of the comprehension are called the output (expression), the input (range) and the filter (predicate).

Common uses of comprehensions

[7*x-3 | x in 1..n | x mod 3 == 1]  // operate on (map) and filter a range 
[7*x-3 | x in a | x mod 3 == 1]     // map and filter an existing list
[7*x-3 | x in 1..n | true]          // no filtering, just map
[x | x in 1..100 | isPrime(x) ]     // just a filter, no map

Sketches

{7*x-3 | x in 1..n | x mod 3 == 1}
[7*x-3 for x in a where x mod 3 == 1]
[7*x-3 for x in a if x mod 3 == 1]              // Python
[7*x-3 for each (i in a) if (x % 3 == 1)]       // JavaScript 1.7 (uncommon)
for (i <- (1 to 10) if x % 3 == 1) yield 7*x-3  // Scala
[7*x-3, x in 1..n, x mod 3 == 1]
{7*x-3 : x ∈ 1..n ∧ x mod 3 == 1}               // Plain old Math!!
[7*x-3 : x ∈ 1..n, x mod 3 == 1]
[7*_-3 | a | _ mod 3 == 1]                      // implicit variable

Complex input sets

How do we handle the idea of two range variables for the input set? We can write two range expressions or range over a set of tuples:

[2*x+y | x in 1..n, y in 'a'..'z' | x mod 3 == 1 and isVowel(y)]
[(x, y*y) | (x, y) ∈ a × (1..10) | x < y]
[x + y for x in a for y in b]                   // Python

Life without comprehensions

If you don't have comprehensions, or a compact, symbolic list append operator, you end up writing this:

List<Integer> a = new List<Integer>();
for (int x = 1; x <= n; x++) {
    if (x % 3 == 1) {
        a.append(7 * x - 3);
    }
}

Many programmers like that kind of coding. More verbose, but everyone can understand it.
In languages that treat for and if as suffixes to simple statements (like Perl and Ruby), we get compact forms that can be nearly as terse as comprehensions:

a = [];
for x in 1..n
  a << 7 * x - 3 if x mod 3 == 1
end

a = [];
a << 7 * x - 3 if x mod 3 == 1 for x in 1..n

a = [];
a << 7 * x - 3 for x in 1..n where x mod 3 == 1

And, of course, there are map and filter (or select as it is called in Ruby)...

(1..n).filter(x -> x mod 3 == 1).map(x -> 7 * x - 3)
(1..10).select{|x| x % 3 == 1}.map{|x| 7 * x - 3}         // Ruby
(1 to n) filter { _%3==1 } map { 7*_-3 }                  // Scala
map((x -> 7 * x - 3), filter((x -> x mod 3 == 1), (1..n)))

Try the previous examples with your favorite form of lambda expression.

Here's an interesting variation, in Ruby that uses map but not select. Instead we add a if clause to the map, which generates extra nils, which we squeeze out with compact. Not pretty, but it gets that nice little "if" in there. Fun.

(1..n).map{|x| 7 * x - 3 if x % 3 == 0}.compact

Information Hiding

Information hiding occurs when a language allows some program entities to be invisible to parts of a program. When applied to data within an object or module, we usually use the more specific term encapsulation. Information hiding is a good thing since it can make it harder to write certain types of mailicious code, and make it harder to accidentally ruin parts of a system you didn't mean to touch.

Here are some different language forms for information hiding.

Explicit information hiding

A direct approach is to use a modifier — a keyword or symbol that defines an access level for an entity. Here are various possible forms:

private var x;
protected double x;
private protected var x;
package protected boolean x;
package local var x;
public array of int x;
x: public string;
-int x;
+int x;
-(int) x;
+(int) x;

These modifiers can be used to hide a class within a package, a member within a class, or just about anything within a module. Visibility of fields for example, can be restricted to

The object in which it appears
All objects of the same class in which it appears
The package in which it appears
Its class and all its class's subclasses
Its class and subclasses and the same package

...and it can have no restrictions at all (this generally uses the modifier public).

For languages that favor symbols over words, you could use something like

- for most-restricted (private)
# for semi-restricted or (protected or package-private or package-protected)
+ for least-restricted or (public)

A good case can be made for + and - ... not sure about others.

Implicit encapsulation

The other approach is to implement information hiding simply by the position of entity declarations. This is normally done by nesting, for example, local variables are almost always automatically private to a function (or method or procedure) body:

function f(x) {
    var y;
    ...
}

Parameters are usually private to a function too. We say "usually" because in languages that support named parameter association, calling code can see the parameter name, but it can only be used in the call, so it's plenty safe.

We can have variables that are local to a block, or an iteration statement:

for i in x..y
  # Some code using i
end

x.upto(y) do |i|
  # Some code using i
end

One interesting style is to have a module in which there is one section (usually called the interface) which holds the visible identifiers and another section (the implementation) holding private stuff. A sketch:

module Stacks
    type Stack(int capacity);
    vpid push(Object)
    Object pop()
    int size()
implementation
    Any variable defined here is private
    to this entire implementation section
    . . .
    Here we expand on the interface items
    type Stack(int capacity) is
        int top = 0; 
        Object[] data = Array(Object).new(capacity)
    end
    void push(Object x) 
        raise Overflow if top == capacity;
        data[top++] = x
    end
    Object pop()
        raise Underflow if top == 0;
        data[top--]
    end
    int size()
        top
    end
end

Hybrid Approaches

We can also have modules in which everything is private (hidden) by default — and an explicit export is needed.

module m
    export a;    // but not b
    int a, b;
    . . .
end

In some cases modules are completely closed off and you need an explicit import also.

int x, y;
. . .
module m
    import x;    // but not y
    export a;    // but not b
    int a, b;
    . . .
end

2010-05-02

Reserved Words

Many languages reserve words for their own use, meaning that you, the programmer, cannot use these words for your own purposes. Reserved words are related to keywords which are words used for certain purposes by the language, but which you can also use as you wish.

What is the motivation for a language to have reserved words? Are reserved words a good thing or not? Is there such thing as different degrees of being reserved?

Why reserve words at all?

A language might want to reserve words to limit confusion. The following are legal statements in PL/1, a language which chose not to reserve IF, THEN, ELSE, DO, and similar keywords:

IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;
IF IF THEN ELSE = IF; ELSE THEN = ELSE;
DO WHILE (WHILE = DO); END = WHILE + DO; END;

Would anyone really write code like that? Even if someone felt the need to abuse keywords like that, simply reserving them doesn't completely prevent programmers from writing confusing code in other ways. There's no stopping this:

int subtract(int x, int y) {
    return x + y;
}

Perhaps the real reason languages reserve words is to make things easier for the compiler, or to give the language authors an outlet in which to be funny. For example, the word goto is reserved in Java, even though it isn't used for anything. Of course, nothing prevents a Java programmer from using the word in package names, comments, and string literals. And, as Java is case-sensitive, there's always class Goto ... .

Why not reserve words?

There are a few reasons for not reserving words:

Programmers should have the freedom to choose their own identifiers. It's the compiler's job to figure out when words are used in a particular context.
It's annoying to programmers. There are thousands of Java programs in the world using variable and parameter names called klass, clazz, or c, because class is verboten. (Awesome hackers might even use clаss — haha, the third letter there is a Cyrillic small letter a, not a Latin one!)
As a language evolves, the addition of new reserved words will break old code.

Keywords versus symbols

Here are three alternative syntaxes for a certain statement:

if greater(x, y) and w then begin assign x plus 4 to z; print z; end;
if (x > y && w) {z = x + 4; print z}
(x > y && w ==> z = x + 4, !z)

The obvious things to notice are that too many keywords make code too verbose and readability goes down. Too few keywords gives you cryptic code. Most people understand + and * and /, but really, a ! for "print"? In fact, most symbols aren't obvious at first. Different languages use = differently. And what about these things: <=>. =~. %. ===, ^, *., #, ??, and who knows what else? Anyone use APL?

Symbols are sometimes preferred to keywords for another reason: if keywords are in a language you don't speak, they might be harder to learn than symbols. Programmers that know Spanish but not English may prefer 'y' to 'and' and 'mientras' to 'while'. This is one use for macros, I suppose. In C:

#define mientras while

Symbols as keywords

In ML, identifiers can be alphanumeric or symbolic. So +, <=*=> and ====> are all identifiers. You can define functions with these names. And you can define + to do subtraction. You can even do that in Ruby. Is this okay? The choices seem to be:

Allow symbolic identifiers, but treat some as reserved words so that no programmer can override such basic operators as +.
Allow symbolic identifiers but place no restrictions on them. After all, the reason languages allow "operator overloading" is to allow addition on vectors and other structured objects.
Disallow symbolic identifiers but permit operator overloading (for the fixed set of predefined symbolic operators).
Prohibit both symbolic identifiers and operator overloading.

How reserved is reserved?

When a language specification says that a word is reserved, it generally means one of two things:

The word cannot, under any circumstances, be used as a name for any programmer-defined entity (e.g., a variable, constant, function, type, object property name, method, function parameter, statement label, etc.)
The word can be used to name some programmer-defined entities, but its position in the source code text is restricted to simplify language processing (specifically, tokenization).

The former case may be stricter than necessary because quoting, for one thing, works pretty well for object properties (at least):

point["long"] = 118.3532;

You can't stop people from using reserved words in string literals, after all. Similar to quoting is escaping; for example, you would write something like :

var @while = 2;

The @ is not part of the variable name, it's just an escape character that tells the compiler the variable is really named 'while' but we're going to use it as an identifier and not as the beginning of a while statement.

The second case is interesting. JavaScript, for example, reserves the word "long" and while it allows

point["long"] = 118.3532;

it does not allow

point.long = 118.3532;

The probable reason for this is that it wants its lexical analyzer to be independent of context: for it to simply look that the above text and tokenize it into

    IDENTIFIER ("point")
    DOT
    KEYWORD_LONG
    ASSIGNMENT_OPERATOR
    NUMBER ("118.3532")

But is that really important in practice anymore? Lexical analyzers can easily choose a different meaning for a token if, say, the previous token is a dot! That would allow point.long. Now what about var point = {lat: 40, long: -110}? It would seem we'd have to allow the lexical analyzer to have access to the parser's state, but really, just wait until the semantic analysis phase and allow certain known keyword tokens to be just as good as identifiers when it comes to object property names.

2010-04-29

Object Construction

There are countless ways to build objects. Here are a few.

Classic Constructors

Many class-based object oriented languages have constructors that look suspiciously like methods. Unlike instance methods which operate on existing objects, though, these constructors bring new objects into existence, and generally use a special syntax (such as an operator called new) to distinguish their invocation from that of a method.

new Point("Pantages Theater", 34.1017, -118.3255, 
    "Hollywood", "Los Angeles", "CA", "us", "90028", 
    new LocalDate(1930, 6, 4), Status.ACTIVE)

If there are too many parameters, you can pass in a map (dictionary, associative array, hash), or your language might support named parameter association.

Setters

With setter methods, you first construct an object in which all of its fields are set to default values, then you customize the object by setting just those fields that differ from the defaults.

Point p = new Point();
p.setName("Pantages Theater");
p.setLatitude(34.1017);
p.setLongitude(-118.3255);
p.setNeighborhood("Hollywood");
p.setCity("Los Angeles");
p.setState("CA");
p.setCountry("us");
p.setPostalCode("90028");
p.setEstablished(new LocalDate(1930, 6, 4));
p.setStatus(Status.ACTIVE);

This approach solves the three problems above, but introduces a new problem: the object being constructed can't be immutable. How can you get around this?

Some languages allow fields to be marked "final" so that after the first set, they can't be set again.
In some languages, you can scope the setter methods so that only the parts of the code that are supposed to the construct the objects can see the setters. Since objects should be able to be constructed anywhere, this all but requires the programmer to factor out construction code into its module, which could be unwieldy. But there are ways to do it nicely in general; one way is the builder, described next.

Builders

Builder objects are common too. They have the three advantages of the setter approach (no need to supply non-default values, order doesn't matter, and reader knows what the arguments mean), while also allowing the newly constructed object to be immutable.

Point p = new PointBuilder()
    .name("Pantages Theater");
    .latitude(34.1017);
    .longitude(-118.3255);
    .neighborhood("Hollywood");
    .city("Los Angeles");
    .state("CA");    
    .country("us");
    .postalCode("90028");
    .established(new LocalDate(1930, 6, 4));
    .status(Status.ACTIVE)
    .build();

Factory Methods

A factory method is any regular method (or regular function) that creates and returns a new object. There's no special syntax here; all that is done is to call a constructor, builder, or other device inside the body of the factory method. In Java:

public class Card {
    private Suit suit;
    private Rank rank;
    private Card(Suit s; Rank r) {       // Constructor must be private
        suit = s;                        // to force all clients to use
        rank = r;                        // the factory method below
    }
    public static Card fromRankAndSuit(Rank r, Suit s) {
        return new Card(s, r);
    }
    . . .
}

Literals

In languages without classes, or when classes are not too important (say, when there is duck typing), object literals create new objects. Here's a JavaScript example. Fields are string literals, but we get to omit the quotes if the field name is a simple word that is not a reserved word:

{name: "Pantages Theater", lat: 34.1017, "long": -118.3255, 
    neighborhood: "Hollywood", "state or province": "CA",
    country: "us", "postal code": "90028", city: "Los Angeles", 
    established: new LocalDate("1930-06-04"), status: Status.ACTIVE)}

Object literals are also possible in languages with statically typed, closed classes. In Ada:

type Point is record
    X: Integer;
    Y: Integer;
end record;
. . .
var P: Point := Point'(6, 5);
var Q: Point := Point'(Y => 5, X => 6);

Construction in Prototypal Languages

Modern (ECMAScript 5-based) JavaScript has nice features for creating objects with a given prototype. I came up with this pattern:

/*
 * An immutable circle datatype.  Synopsis:
 *
 * var c = Circle.create(5, {x: 1, y: 4});
 * c.radius      => 5
 * c.center      => {x: 1, y: 4}
 * c.area()      => 25π
 * c.perimeter() => 10π
 */
var Circle = {
    prototype: {
        area: function () {return Math.PI * this.radius * this.radius;},
        perimeter: function () {return 2 * Math.PI * this.radius;}
    },

    create: function (r, c) {
        return Object.create(this.prototype, {
            center: {value: c},
            radius: {value: r}
        });
    }
};

The circles we create with Circle.create are immutable because we use JavaScript's default property descriptors for center and radius: they're not writable, not enumberable, we can't delete them, nor can we change these settings. The same isn't true for the Circle properties (prototype and create), nor for the prototype's own properties...is it worth using property descriptors on these too?