Programming Language Design Sketchbook

2011-03-05

Type Coercion

A type conversion is an operation that takes a value of one type and returns a value of another type such that the two values are somehow equivalent. Examples include:

float(5) ⇒ 5.0 // integer to float
string('s') ⇒ "s" // character to string
codepoint('A') ⇒ 0x41 // character to integer
A (Java) FileInputStream converted to a ByteArrayInputStream by first loading into memory

If no conversion operator explicitly appears in the code, the conversion is called a type coercion. Like all implicit operations, coercions can be convenient (less source code to type and read), but also dangerous (as things happen without the programmer's explicit authorization).

How can languages make conversions and coercions available to programmers? Coercions are usually dealt with in one of three ways:

Allow all corecions
Disallow all coercions
Allow only certain coercions, such as subtypes to supertypes

No Coercions Ever

Languages like Ada and ML have a very simple rule: no coercions ever. One cannot even use an integer where a floating point value is expected, without explicitly casting the integer to a float.

Implicit Upcasting

Some languages, such as Java, allow casting in only one way, from subtypes to supertypes. This is called upcasting. You can use an integer where a float is expected, and the integer value will be coerced to the right floating point value. You can't go the other way automatically, though.

Implicit Downcasting

Coercions from a supertype to a subtype (downcasting) are inherently dangerous because information is being lost. In C:

int x = 8.2;            // x gets 8
int y = 8.5;            // does y get 8 or 9?

Directly assigning a floating point value to an integer variable in its declaration is unrealistic; here's something that is more likely to occur:

void f(char* s, int n) { ... }
.
.
.
f("some text", z);   // z is a float, coerced to a int within f

Again, this may not be too bad in practice; z will be truncated or rounded and the resulting value will be passed to the parameter n. Unless z is a global variable and is assigned the value of n within f, there's no problem, right? People would never do such a thing, right?

Never say never! SAP has a product called Business Objects Data Integrator (BODI), with an embedded database programming language. In the Data Integrator Reference Guide, on page 453, they describe this function:

ifthenelse: Allows conditional logic in expressions.
Syntax
ifthenelse(condition, true_branch, false_branch)
Return value
true_branch or false_branch
Returns one of the values provided, based on the result of condition. The data type of the return value is the data type of the expression in true_branch. If the data type of false_branch is not convertible to the data type of true_branch, Data Integrator produces an error at validation. If the data types are convertible but don’t match, a warning appears at validation.

"The type of the return value is the datatype of the expression in true_branch"? That's not exactly intuitive. Combined with the fact that BODI (or the underlying database it may use) does implicit coercions from floats to integers, users are opened up to the following disaster:

P.latitude = ifthenelse(P.latitude is null, 0, P.latitude)

This is an attempt to clean up a null value by setting it to 0. But assuming the latitude is a regular floating point value, this assignment destroys the value, turning 43.2, say, into 43. If latitudes and longitudes are both "adjusted" this way, places in good neighborhoods end up in bad neighborhoods or in the middle of a river or ocean.

Coercions and Type Systems

Why would BODI even create an ifthenelse operator that would coerce an argument? This is a classic case of mixing static typing with weak typing. Note:

Static Typing means that all expressions must be fully typed at compile time, while dynamic typing means that types are resolved at run time.
Strong Typing means that operators applied to arguments of the wrong type generate errors, while weak typing means that the values will be coerced (wherever possible) to make the operation valid.

Static typing goes well with strong typing (ML, Ada), while dynamic typing can be combined with either strong (Ruby) or weak (Perl, JavaScript) typing, without too many surprises.

How should BODI's ifthenelse operator have been defined? Using static and weak typing together is not really the worst element at play here; it's the implicit downcasting of P.latitude to an integer. Why did they do this? Perhaps the designers of this language thought it would be "too complicated" to define the return type as "the most general type of true_branch and false_branch" and thought it "too restrictive" to force the two types to match exactly. They could have at least detected the fact that the false_branch had a more general type then the true_branch and generated and error there. Perhaps they thought that "too confusing." But wait, taking the type of true_branch as the result type is asymmetric and confusing on its own! How is that intuitive?

Note that with dynamic typing this problem never would have arisen since the ifthenelse expression would have produced either its second or third argument and hence taken on the type of the value returned, whatever that was. A dynamically typed language would never have even considered a coercion in that case at all! And a statically typed language with strong typing would have rejected the expression outright.

Static + weak is the problem. Language designers would be wise to avoid this combination.

2010-12-31

Dictionaries

Nearly every non-trivial application makes use of dictionaries, a.k.a. maps, associative arrays, hashes, lookup tables. What are some useful and convenient forms for these objects?

Types of Dictionaries

A look at the Java Collections and Concurrency APIs shows that there are quite a few kinds of dictionaries out there. Among them (note that these types are not mutually exclusive) are:

HashMap
SortedMap
TreeMap
LinkedHashMap
ConcurrentMap
NavigableMap
SkipListMap
IdentityHashMap
EnumMap
WeakHashMap

Literal Forms

If you've only coded in Java, C, Fortran, Pascal, Ada, or similar languages, you're missing out on the fun of writing dictionary literals. Here is a sampling of syntaxes:

{"CA":"Sacramento", "HI":"Honolulu", "OR": "Salem", "MA":"Boston"}
{CA:"Sacramento", HI:"Honolulu", OR: "Salem", MA:"Boston"}
{"CA" => "Sacramento", "HI" => "Honolulu", "OR" => "Salem", "MA" =>"Boston"}
{CA => "Sacramento", HI => "Honolulu", OR => "Salem", MA => "Boston"}
{:CA => "Sacramento", :HI => "Honolulu", :OR => "Salem", :MA => "Boston"}

The colon as a separator is probably the most popular; it appears in well-known data interchange formats like JSON and YAML. The => is called a hashrocket, because dictionaries are sometimes called hashes. (They shouldn't be called hashes in general; that term should only be used when you know that internally the dictionary uses a hash table for storage. Often you'll do fine with a search tree or a plain old array.)

Coping without Literals

Some languages do not allow literal forms for dictionaries. What can you do? In a dyanmically typed language, some kind of constructor taking a string with delimiters works pretty well:

def dictionary (s)
  d = Dictionary.new;
  pairs = s.split("|")
  for pair in pairs
    key, value = pair.split(",")
    d.put(key, value)
  end
  return d
end

states = dictionary("CA,Sacramento|HI,Honolulu|OR,Salem|MA,Boston");

Works for this one case, but in general the delimiters should be parameters...

Java programmers can combine static initializers with anonymous subclasses to approximate literals:

Map<String, String> states = new HashMap<String, String>() {
    {
        put("CA", "Sacramento");
        put("HI", "Honolulu");
        put("OR", "Salem");
        put("MA", "Boston");
    }
};

Testing Key Equality

TODO

Static Typing

TODO

Multimaps

TODO

2010-11-30

Invocation

The number of possibile ways to construct a subroutine (function, procedure, continuation, thread, task, process, etc.) invocation are enormous. There are dozens of known ways to express an invocation syntactically, and probably a dozen known semantics for argument passing. This brief article lists many current approaches, with examples, and offers some new styles.

Syntactic Invocation Patterns

There are dozens of ways to form an invocation, syntactically

Prefix, parentheses required after the routine name
- f(x, y, z)
- f()
- f(x)
- f(x, g(y, z))
- f(x, g(y), z)
- "+"(x, y)
- op+(x, y)
- operator+(x, y)

Semantics of Invocation

Semantics of Argument Passing

2010-10-31

Generators

A generator is a entity that you can call at any time to request the next value in some sequence. For simplicity, this post will focus on a particular generator: one for Fibonacci numbers.

Generators as objects

In an object-oriented language, one can create a specific generator with a property holding state and a method to return the next value. In JavaScript:

var fibonacciGenerator = { 
    a: 0, 
    b: 1, 
    next: function () { 
        var result = this.b; 
        this.b += this.a; 
        return this.a = result; 
    } 
};

Normally, you want to protect the generator state from tampering. One way to do this is to use a module that supports private variables. In Ruby:

Similarly, in a classical (as opposed to prototypal) OO language, you can make a class for your generator. Then different instances can be constructed, possibly with different starting points or different rules. Again in Ruby:

Implementing Generators with Closures

Generators in C

Language Support for Generators

2010-10-30

Anonymous Functions

A function is something you invoke, passing zero or more inputs (called arguments) and getting back zero or more outputs (called results). Functions generally have parameters (which match up with the arguments they accept), and a body, which contains the code that the function executes. What isn't necessary is that the function have a name. A function without a name is called — not surprisingly — an anonymous function.

How can we make anonymous functions easy to read and write?

Preliminaries

A few things to consider when designing support for anonymous functions:

Should the function body be a single expression or a block with a return statement or statements?
What kind of syntactic device should be used to delimit the function?
Are functions first-class values (as in Scheme, Haskell, or JavaScript) or are simple "blocks" or "continuations" sufficient?

Terse Designs

When the function body is a simple expression, there are a number of very terse ways to denote a function, many of which are used in existing languages. Here are some sketches for a function whose sole input is (an integer) x, and whose result is x/2 if x is even and 3*x+1 otherwise:

[x -> if x mod 2 = 0 then x div 2 else 3 * x + 1]
{x → x % 2 === 0 ? x / 2 : 3*x + 1}
λ x . (x / 2 if x % 2 == 0 else 3 * x + 1)
(LAMBDA (x) (IF (= (% X 2) 0) (/ X 2) (+ (* 3 X) 1)))
fn x => if x mod 2 = 0 then x / 2 else 3 * x + 1
{|x| x % 2 === 0 ? x / 2 : 3*x + 1}
function from x to x % 2 === 0 ? x / 2 : 3*x + 1 end
x to x % 2 === 0 ? x / 2 : 3*x + 1
fun (x) returning x % 2 === 0 ? x / 2 : 3*x + 1 end fun

For those forms which do not have special outer brackets to indicate a function expression, regular parentheses can be used to delineate them when they appear in the middle of a larger expression.

Verbose Designs

Function expressions that have complex bodies, such as those with several statements, including return statements can simply be written as function declarations without a name. See the JavaScript example below.

Uses of Anonymous Functions

There are two main uses of anonymous functions:

Arguments to functions like map, filter, and reduce.
To act as a module, in which local variables play the role of private variables. An example in JavaScript, in which a variable is shared between two exported global functions:
```
(function () {
    var x = 0;
    f = function () {return x -= 5;}
    g = function () {return x = 2 * Math.abs(x);}
}());
```

2010-09-30

Values and No Values

Consider something as simple as recording a property for an object, say, a person's supervisor. What kind of possibilities exist? Here are some:

I have a supervisor and her name is Alice.
I definitely have no supervisor.
I may or may not have a supervisor; I really don't know.
I do know whether I have a supervisor, but I don't care to make this information public; in other words, it's none of your business.

How can we capture these cases in a programming language?

Strings Only

For languages with unsophisticated type systems (say, strings or symbols only), you might try:

var supervisor1 = "Alice";
var supervisor2 = "None";
var supervisor3 = "Unknown";
var supervisor4 = "Private Info";

The obvious problem here is that people can have any name at all, even "None", "Nobody", "N/A", "Robert'); drop table students;--", or even "". You'd have to express in comments, and add logic to your application, that certain strings really aren't names. This isn't what we'd call a clean solution. What we want is a solution in which the values representing no (or an unknown) supervisor belong to a type or types other than plain string.

Disjoint Sum Types

ML and related languages feature a very clean syntax for defining flexible datatypes:

datatype Person = None | Unknown | WillNotSay | Actual of string;
val supervisor1 = Actual "Alice";
val supervisor2 = None;
val supervisor3 = Unknown;
val supervisor4 = WillNotSay;

This works very nicely: the type of None is Person, not string. In fact, the type of each supervisor variable above is (inferred to be) Person. While this is a nice statically-typed solution, a programmer would need to add the special constants None, Unknown, and WillNotSay to all types for which this kind of information is relevant. You can do this "once" with a polymorphic type:

datatype 'a info = None | Unknown | WillNotSay | Actual of 'a;
val supervisor1 = Actual "Alice";
val supervisor2: string info = None;
val supervisor3: string info = Unknown;
val supervisor4: string info = WillNotSay;

Here we've given explicit types to supervisor2, supervisor3, and supervisor4, since the None, Unknown, and WillNotSay constructors are polymorphic.

JavaScript's null and undefined

In a dynamically typed language like JavaScript we often have values in distinct types that represent nothingness or lack of knowlege. In JavaScript, for example:

null is the sole member of the Null type and represents the certainty of having no real value.
undefined is the sole member of the Undefined type and represents the lack of knowledge about its real value.

JavaScript doesn't natively distinguish between cases 3 and 4 — lack of knowledge and the refusal to share that knowledge. It could be argued that isn't really a common case anyway. Now you could, if you really wanted to, make this distinction somewhat like this:

var alice = {name: "Alice"; supervisor: null};
var bob = {name: "Bob"; supervisor: alice};
var eve = (name: "Eve"; supervisor: undefined};
var mallory = {name: "Mallory"}

This uses the lack of a supervisor property of Mallory to say she makes no claim to even having a supervisor, and isn't telling you if she has one at all. This approach is a little ugly in practice since, in JavaScript anyway, evaluating mallory.supervisor produces undefined! You'd have to dig into the object and examine its properties in order to pick up on the difference.

Marker Objects

Another approach that works well in a dynamically-typed language is to create the special values yourself, as simple, plain old objects. In JavaScript:

var NONE = {};
var UNKNOWN = {};
var WILL_NOT_SAY = {};

It should be fairly easy to figure out how to use these values.

This approach doesn't quite work as well in a statically-typed language. In Java, for example, marker objects would be implemented like this:

public static final Object NONE = new Object();
public static final Object UNKNOWN = new Object();
public static final Object WILL_NOT_SAY = new Object();

But, this means, of course, properties such as supervisor will have to be given the Java type Object, which goes against the whole point of using a statically typed language. ML's disjoint sum types are better for static languages.

2010-08-22

Time, Dates, and Calendars

Few programming languages have direct support for times and dates, leaving such support to libraries. This is fine. However, many languages (including Java and JavaScript) have time and date support in their standard libraries which is very poor. Java's Calendar and Date classes are particularly pathetic; JavaScript's is lame but at least fully admits its deficiencies, promising nothing. Fortunately programmers can turn to third-party libraries (including the excellent Joda-Time for Java).

Here I want to convey the basic issues in representing temporal data and operations in programming, sketch out ways to make better libraries, and look at ways to include time and date support within a language.

Basic Definitions

Many programmers have an insufficient understanding of time and dates. Here is a short, and in many cases oversimplified, overview of the main concepts:

Time

The one-dimensional continuum of experiences.

Instant

A point in time. Because time is one-dimensional, an instant can be given with a single real number, once we choose an origin and define a unit. The origin (instant 0) of the one-dimensional space is called the epoch. Common units for the description and measurement of time include the second and the millisecond.

Duration

A quantity of time, for example "17343.3422 milliseconds." Note that instants are points but durations are vectors. An instant plus a duration is an instant; An instant minus an instant is a duration.

Interval

The notion of elapsed time between two particular instants. Given an interval, you can compute its duration, but not vice versa. For example, the interval between today at noon and tomorrow at noon different from the interval between yesterday noon and today noon, even though the durations of each interval are the same.

DateTime

A description of an instant in human-friendly terms, generally, but not necessarily, based on aspects of the Earth, Sun, and Moon. DateTimes are often written with many components, e.g., 2010-08-22T17:29:41-07:00 has year, month, day, hour, minute, second, and offset components, respectively.

Calendar

A framework for describing datetimes. Also called a chronology. The chronologies natively supported by Joda-Time are:

Gregorian
Julian
GJ
ISO
Buddhist
Coptic
Ethiopic

Era

A large division of dates within a calendar. For example, the GJ calendar contains two eras: BCE and CE (also known as BC and AD).

Partial

A fragment of a datetime, leaving parts unspecified. For example: "August 1", "14:40". These are useful to describe recurring events, such as birthdays. They generally do not map to any particular instant or interval, but rather to many.

TimeZone

An optional component of a date time that allows people to have similar associations between the time portion of a datetime and the position of the sun relative to their location on Earth. Usually represented as an offset (duration), which can vary according to local law (e.g., Daylight Saving).

Period

Similar to a duration, but expressed with human-friendly terms like months, weeks, and days. A period does not have an exact number of milliseconds; for example "2 months" can have a different number of days, depending on the time of year; and even "1 day" can have 22, 23, or 24 hours when daylight time is taken into account.

A Time Library

A good time library should maintain types for each of the above concepts, as well as a rich set of parsing and formatting routines for dates, conversions between types, and operators. Joda-Time qualifies.

The ability to construct new calendars (chronologies) is also desirable. What would such a capability look like?

TODO

Native Language Support for Time

Because the concepts of time of dates are complex, language-level support for all of the concepts introduced above can become quite cumbersome. How many reserved words or special operators can a language, especially a general purpose one, absorb? How many literal forms should be added?

TODO