2010-09-30

Values and No Values

Consider something as simple as recording a property for an object, say, a person's supervisor. What kind of possibilities exist? Here are some:

  1. I have a supervisor and her name is Alice.
  2. I definitely have no supervisor.
  3. I may or may not have a supervisor; I really don't know.
  4. I do know whether I have a supervisor, but I don't care to make this information public; in other words, it's none of your business.

How can we capture these cases in a programming language?

Strings Only

For languages with unsophisticated type systems (say, strings or symbols only), you might try:

var supervisor1 = "Alice";
var supervisor2 = "None";
var supervisor3 = "Unknown";
var supervisor4 = "Private Info";

The obvious problem here is that people can have any name at all, even "None", "Nobody", "N/A", "Robert'); drop table students;--", or even "". You'd have to express in comments, and add logic to your application, that certain strings really aren't names. This isn't what we'd call a clean solution. What we want is a solution in which the values representing no (or an unknown) supervisor belong to a type or types other than plain string.

Disjoint Sum Types

ML and related languages feature a very clean syntax for defining flexible datatypes:

datatype Person = None | Unknown | WillNotSay | Actual of string;
val supervisor1 = Actual "Alice";
val supervisor2 = None;
val supervisor3 = Unknown;
val supervisor4 = WillNotSay;

This works very nicely: the type of None is Person, not string. In fact, the type of each supervisor variable above is (inferred to be) Person. While this is a nice statically-typed solution, a programmer would need to add the special constants None, Unknown, and WillNotSay to all types for which this kind of information is relevant. You can do this "once" with a polymorphic type:

datatype 'a info = None | Unknown | WillNotSay | Actual of 'a;
val supervisor1 = Actual "Alice";
val supervisor2: string info = None;
val supervisor3: string info = Unknown;
val supervisor4: string info = WillNotSay;

Here we've given explicit types to supervisor2, supervisor3, and supervisor4, since the None, Unknown, and WillNotSay constructors are polymorphic.

JavaScript's null and undefined

In a dynamically typed language like JavaScript we often have values in distinct types that represent nothingness or lack of knowlege. In JavaScript, for example:

  • null is the sole member of the Null type and represents the certainty of having no real value.
  • undefined is the sole member of the Undefined type and represents the lack of knowledge about its real value.

JavaScript doesn't natively distinguish between cases 3 and 4 — lack of knowledge and the refusal to share that knowledge. It could be argued that isn't really a common case anyway. Now you could, if you really wanted to, make this distinction somewhat like this:

var alice = {name: "Alice"; supervisor: null};
var bob = {name: "Bob"; supervisor: alice};
var eve = (name: "Eve"; supervisor: undefined};
var mallory = {name: "Mallory"}

This uses the lack of a supervisor property of Mallory to say she makes no claim to even having a supervisor, and isn't telling you if she has one at all. This approach is a little ugly in practice since, in JavaScript anyway, evaluating mallory.supervisor produces undefined! You'd have to dig into the object and examine its properties in order to pick up on the difference.

Marker Objects

Another approach that works well in a dynamically-typed language is to create the special values yourself, as simple, plain old objects. In JavaScript:

var NONE = {};
var UNKNOWN = {};
var WILL_NOT_SAY = {};

It should be fairly easy to figure out how to use these values.

This approach doesn't quite work as well in a statically-typed language. In Java, for example, marker objects would be implemented like this:

public static final Object NONE = new Object();
public static final Object UNKNOWN = new Object();
public static final Object WILL_NOT_SAY = new Object();

But, this means, of course, properties such as supervisor will have to be given the Java type Object, which goes against the whole point of using a statically typed language. ML's disjoint sum types are better for static languages.

2010-08-22

Time, Dates, and Calendars

Few programming languages have direct support for times and dates, leaving such support to libraries. This is fine. However, many languages (including Java and JavaScript) have time and date support in their standard libraries which is very poor. Java's Calendar and Date classes are particularly pathetic; JavaScript's is lame but at least fully admits its deficiencies, promising nothing. Fortunately programmers can turn to third-party libraries (including the excellent Joda-Time for Java).

Here I want to convey the basic issues in representing temporal data and operations in programming, sketch out ways to make better libraries, and look at ways to include time and date support within a language.

Basic Definitions

Many programmers have an insufficient understanding of time and dates. Here is a short, and in many cases oversimplified, overview of the main concepts:

Time
The one-dimensional continuum of experiences.
Instant
A point in time. Because time is one-dimensional, an instant can be given with a single real number, once we choose an origin and define a unit. The origin (instant 0) of the one-dimensional space is called the epoch. Common units for the description and measurement of time include the second and the millisecond.
Duration
A quantity of time, for example "17343.3422 milliseconds." Note that instants are points but durations are vectors. An instant plus a duration is an instant; An instant minus an instant is a duration.
Interval
The notion of elapsed time between two particular instants. Given an interval, you can compute its duration, but not vice versa. For example, the interval between today at noon and tomorrow at noon different from the interval between yesterday noon and today noon, even though the durations of each interval are the same.
DateTime
A description of an instant in human-friendly terms, generally, but not necessarily, based on aspects of the Earth, Sun, and Moon. DateTimes are often written with many components, e.g., 2010-08-22T17:29:41-07:00 has year, month, day, hour, minute, second, and offset components, respectively.
Calendar
A framework for describing datetimes. Also called a chronology. The chronologies natively supported by Joda-Time are:
  • Gregorian
  • Julian
  • GJ
  • ISO
  • Buddhist
  • Coptic
  • Ethiopic
There are of course, many, many others.
Era
A large division of dates within a calendar. For example, the GJ calendar contains two eras: BCE and CE (also known as BC and AD).
Partial
A fragment of a datetime, leaving parts unspecified. For example: "August 1", "14:40". These are useful to describe recurring events, such as birthdays. They generally do not map to any particular instant or interval, but rather to many.
TimeZone
An optional component of a date time that allows people to have similar associations between the time portion of a datetime and the position of the sun relative to their location on Earth. Usually represented as an offset (duration), which can vary according to local law (e.g., Daylight Saving).
Period
Similar to a duration, but expressed with human-friendly terms like months, weeks, and days. A period does not have an exact number of milliseconds; for example "2 months" can have a different number of days, depending on the time of year; and even "1 day" can have 22, 23, or 24 hours when daylight time is taken into account.

A Time Library

A good time library should maintain types for each of the above concepts, as well as a rich set of parsing and formatting routines for dates, conversions between types, and operators. Joda-Time qualifies.

The ability to construct new calendars (chronologies) is also desirable. What would such a capability look like?

TODO

Native Language Support for Time

Because the concepts of time of dates are complex, language-level support for all of the concepts introduced above can become quite cumbersome. How many reserved words or special operators can a language, especially a general purpose one, absorb? How many literal forms should be added?


TODO

2010-07-22

Default Parameters

Subroutine calls can be simplified when one doesn't have to provide arguments for every parameter.

Specifying defaults in the signature

If you're lucky, your language might allow your signature to specify defaults:

function f(x = 3, y = 5, z = 10) {
    ...
}

f();            // same as f(3, 5, 10);
f(44);          // same as f(44, 5, 10);
f(32, 1);       // same as f(32, 1, 10);
f(17, 15, 22);  

Other forms for specifying defaults:

(DEFUN f ((x 3) (y 5) (z 10)) ...)
def f(x := 3, y := 5, z := 10) ... end

Supplying defaults in the subroutine body

Some languages don't support a syntax in which defaults can be supplied in a signature, but they do let calls be made with fewer arguments than parameters. Parameters for missing arguments automatically receive a default value like nil or undefined. Whatever the value, these languages almost always treat it as falsy, allowing the use of an ||= assignment, as in:

function f(x, y, z) {
    x ||= 3;
    y ||= 5;
    z ||= 10;
    ...
}

The problem here is that other values which you might want to pass in might be treated as falsy by a language (such as 0 or the empty string), so this approach will not work in general. The idea can be implemented if the language has an operator to determine whether a value is "defined":

function f(x, y, z) {
    x = defined x ? x : 3;
    y = defined y ? y : 5;
    z = defined z ? z : 10;
    ...
}
function f(x, y, z) {
    x = 3 if not defined x;
    y = 5 if not defined x;
    z = 10 if not defined x;
    ...
}
function f(x, y, z) {
    if (!defined(x)) {
        x = 3;
    }
    if (!defined(y)) {
        y = 5;
    }
    if (!defined(z)) {
        z = 10;
    }
    ...
}

These forms aren't as pretty because of the logic required in the body as opposed to the signature.

Overloading

Some languages allow neither defaults to be specified nor calls with too few arguments. In this case you can simulate defaults with overloading. In Java:

public void f(int x, int y, int z) {...}
public void f(int x, int y) {f(x, y, 10);}
public void f(int x) {f(x, 5, 10);}
public void f() {f(3, 5, 10);}

2010-06-28

REST

REST, or representational state transfer, is a set of architectural principles used in the design of the world wide web, that have proven the key to the web's efficiency and scalability. What are these principles, and how can they be used in the design of programming languages?

Architectural Constraints

REST was described by Roy Fielding in his dissertation. He called out the following set of architectural constraints:

  • Client-Server — separates UI from data storage
  • Stateless Server — improves reliability and scalability
  • Client Cache — reduces some network traffic
  • Uniform Interface — decouples implementations from the services they provide
  • Layered System — means that each component only be concerned with those just below or just above it
  • Code-on-Demand — allows client functionality to be extended by downloading applets or scripts

Principles of REST

A RESTful system is characterized by

  • Addressable Resources
    • Every resource has a (unique) identifier (e.g., URI)
    • Example: http://www.yourcompany.com/products/2443
    • NOT http://www.yourcompany.com/getProduct?id=2443
    • Resources are separate from their representations
  • Representation-Orientation
    • Requests for resources return representations
    • The representation contains the state of the resource
    • The client can update or delete resources through their representations
  • Uniform, Constrained Interface
    • Small, fixed number of verbs
    • Example: In HTTP, GET, PUT, POST, DELETE, HEAD, OPTIONS, TRACE only!
    • An example of a non-constrained interface is an RPC-based protocol (e.g., SOAP, CORBA)
  • Self-Descriptive Messages
    • Messages include all the information necessary to be processed
    • Example: Internet Media Types
  • Stateless Server
    • With no shared context on the server, client requests are independent
    • Servers can be simpler, "more" multithreaded, easier to monitor, reliable, scalable, etc.
  • Cacheability
    • Responses indicate whether the client can cache the data or not
    • Obviously can cut down on network traffic if the client knows it doesn't have to ask for the same data again
  • HATEOAS
    • Hypermedia as the engine of application state
    • Representations should include links to related resources

Can we apply any of these principles to programming language design?

System-wide unique identifiers

The idea of identifiers having wide scope, whether unique across a system or unique across the entire world (such as URIs) exists in many forms: UUIDs, MAC addresses, public IP addresses, etc. In programming languages, Java package names are encouraged to be globally unique by using a (reversed) DNS name over which you have control.

Resource/Representation Separation

A limited, fixed, set of verbs

Self-descriptive content

Statelessness

Cachability

Linkage of Representations

2010-06-22

Eval

A common feature in many scripting languages is an eval function. Pass it a string representing source code, and the function will compile it and then interpret it. This function is called eval if the source code string represents an expression that produces a value, and might be called exec if it does not.

Eval is controversial and much-maligned because it is (1) so often abused, (2) so often used when much better alternatives exist, (3) is slow, and (4) can lead to disaster if misused. It is not rare to hear the claim "eval is evil." So the questions are: should a language allow eval and if so, how should this feature be designed?

Problems with eval

Things to watch out for:

  • Eval is slow because a compiler has to be launched to lex and parse the code string, prior to evaluating it.
  • Eval is lame because programmers may sometimes use eval without thinking. While it would be rare for anyone to blatantly pass a fixed string to eval, such as
    eval("x = 3;");
    

    one can usually get away with creating an anonymous function.

  • Eval is a security hole. Since the only reason for using eval is to run code that is supplied at run time, it's possible that this code may come from an untrusted or malicious source. Allowing just anyone to run code on your own machine is crazy.

When is eval not evil?

If you are going to use eval, the string must be completely sanitized. You need to check it for infinite loops, assignments, or calls with side-effects that might destroy the integrity of your application. For example, here is a JavaScript application that accepts an arithmetic expression from a user and evaluates it, first checking to make sure the input consists only of digits, parentheses, and the four basic arithmetic operators:

var s = prompt("Enter a numeric formula");
if (/[^\d()+*/-]/.test(s)) {
    alert("I don't trust that input");
} else {
    alert(eval(s));
}

Eval and Language Design

Either a language will allow evaluation of code strings or it will not. If it does, we can provide that functionality through a function or an operator. Because it is dangerous, it is definitely a candidate construct for being required to appear inside a "warning" construct, or similarly, disallowed from strict modes. Examples:

UNSAFE module Calcuator {
    ....
    // use eval here
    ....
}
use module UNSAFE;
application Calcuator {
    ....
    // use UNSAFE.eval here
    ....
}

Eval as a function

Generally, we see eval as a global function, or a member of a module entitled something like Kernel, System, or perhaps better, UNSAFE.

Eval as an operator

Another way to call attention to the use of eval is to simply make it a unary operator on strings. Perhaps:

var s = prompt("Enter an expression");
alert( `` s );

Alternatively, we might see the string to be evaled enclosed in angle or other brackets.

Disallowing Eval

Clearly we could imagine a language without an eval function or operator. C is one such language.

Forced Sanitization

An interesting new idea is for the eval function to take a second parameter which is (1) a block or anonymous function that performs sanitization, or (2) a regex which will be applied to the string so that the string will only be applied if the regex pattern matches. Example:

var s = prompt("Enter a numeric formula");
alert(eval(s, /[^\d()+*/-]/));

Throwing an exception on not matching would likely be the best course of action here.

2010-06-06

Regular Expressions

A regular expression (or regex, regexp, r.e.) is a pattern that matches a string or a portion of a string. They are used for validation, search, and find/replace. You'll find them in Java, JavaScript, Ruby, Python, Perl, and dozens of other popular languages. All languages agree on a common notation core, e.g. square brackets for character classes, ? for zero-or-one, * for zero or more, + for one-or-more, parentheses for grouping and capturing, and so on.

There's room for some notational innovations here, though. We'll cover a little of what's already in use and sketch some new ideas, too.

Examples from the Common Notation

Virtually all languages will interpret these the same way:

hello
gray|grey
gr(a|e)y
gr[ae]y
colou?r
go*gle
go+gle
g(oo)+gle
z{3}
z{3,6}
z{3,}
[Bb]rainf\*\*k
\d{5}(-\d{4})?
1\d{10}
[2-9]|[12]\d|3[0-6]
Hello\nworld
b..b
\d+(\.\d\d)?
^dog
dog$
^dog$
sh[^i]t
\d+(\.\d+([Ee][+-]?\d+)?)?
https?://(?:www\.)citysearch\.com/profile/[^/]+/(\d+)/?

Metacharacters and Escapes

Most characters stand for themsleves in a pattern. The ones that don't are called metacharacters. Most languages have 14 metacharacters (some more, some less). The usual ones are:

    |    ?    *    +    .    ^    $
    (    )    [    ]    {    }    \

If you want to use a metacharacter as a regular character, you have to escape it, e.g. \+, \[, \\, etc. The backslash also introduces dozens of other simplified expressions:

...
... TODO ...
...

Defining Patterns

How do you write a regex, such as \d+(\.\d+)? in program text? How about in a string literal?

'\d+(\.\d+)?'

Maybe.... The thing is most languages use \ to escape within a string literal. Perl and Ruby don't make the backslash special if the string literal has single quotes, but they do for double quotes, like most languages. In that case, you have to write:

"\\d+(\\.\\d+)?"

And if you need to write a regex that has to match a backslash character, the regex is \\, which looks like this in a string literal:

"\\\\"

But a regex is not a string: it has to be compiled into a little program that matches:

var r = re.compile("\\d+(\\.\\d+)?")
r = "\\d+(\\.\\d+)?".toregex
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");

But since regexes are so important, and that backslash issue gets so annoying, we'd like a language to have a special syntax for describing regexes—something that automatically compiles once, and frees us from doubly escaping backslashes. Common notations include Ruby's %r
and slash delimiters:

%r{\d+(\.\d+)?}
/\d+(\.\d+)?/

The problem with the slash approach is that you have to escape any / characters you want in the regex itself. The %r approach has so such trouble; the language's parser can figure out which is the terminating } easily.

Matchers

To use a regex you match it against some text. This allows you to

  • Validate that the text conforms to a desired pattern
  • Find (and extract) the parts of the text that do conform
  • Replace matching portions of the text with something else

So these matches are stateful ... TODO ... Once you have a (compiled, immutable) pattern, you use it by creating a (stateful) matcher.

Character Classes

Quantifiers

Groups and Capturing

Backreferences

Boundaries

Lookaround

Modifiers

2010-05-20

List Comprehensions

A list comprehension is a compact syntactic form for a list (or array) that is built from applying an operation to certain elements of an existing list (or array). For example:

[7*x-3 | x in 1..n | x mod 3 == 1] 

means the list of all values made by computing 7 * x - 3, for each x in 1 through n where x mod 3 is 1. If n is 10, for example, the generated list is [4, 25, 46, 67]. The three parts of the comprehension are called the output (expression), the input (range) and the filter (predicate).

Common uses of comprehensions

[7*x-3 | x in 1..n | x mod 3 == 1]  // operate on (map) and filter a range 
[7*x-3 | x in a | x mod 3 == 1]     // map and filter an existing list
[7*x-3 | x in 1..n | true]          // no filtering, just map
[x | x in 1..100 | isPrime(x) ]     // just a filter, no map

Sketches

{7*x-3 | x in 1..n | x mod 3 == 1}
[7*x-3 for x in a where x mod 3 == 1]
[7*x-3 for x in a if x mod 3 == 1]              // Python
[7*x-3 for each (i in a) if (x % 3 == 1)]       // JavaScript 1.7 (uncommon)
for (i <- (1 to 10) if x % 3 == 1) yield 7*x-3  // Scala
[7*x-3, x in 1..n, x mod 3 == 1]
{7*x-3 : x ∈ 1..n ∧ x mod 3 == 1}               // Plain old Math!!
[7*x-3 : x ∈ 1..n, x mod 3 == 1]
[7*_-3 | a | _ mod 3 == 1]                      // implicit variable

Complex input sets

How do we handle the idea of two range variables for the input set? We can write two range expressions or range over a set of tuples:

[2*x+y | x in 1..n, y in 'a'..'z' | x mod 3 == 1 and isVowel(y)]
[(x, y*y) | (x, y) ∈ a × (1..10) | x < y]
[x + y for x in a for y in b]                   // Python

Life without comprehensions

If you don't have comprehensions, or a compact, symbolic list append operator, you end up writing this:
List<Integer> a = new List<Integer>();
for (int x = 1; x <= n; x++) {
    if (x % 3 == 1) {
        a.append(7 * x - 3);
    }
} 
Many programmers like that kind of coding. More verbose, but everyone can understand it.
In languages that treat for and if as suffixes to simple statements (like Perl and Ruby), we get compact forms that can be nearly as terse as comprehensions:
a = [];
for x in 1..n
  a << 7 * x - 3 if x mod 3 == 1
end
a = [];
a << 7 * x - 3 if x mod 3 == 1 for x in 1..n
a = [];
a << 7 * x - 3 for x in 1..n where x mod 3 == 1

And, of course, there are map and filter (or select as it is called in Ruby)...

(1..n).filter(x -> x mod 3 == 1).map(x -> 7 * x - 3)
(1..10).select{|x| x % 3 == 1}.map{|x| 7 * x - 3}         // Ruby
(1 to n) filter { _%3==1 } map { 7*_-3 }                  // Scala
map((x -> 7 * x - 3), filter((x -> x mod 3 == 1), (1..n)))

Try the previous examples with your favorite form of lambda expression.

Here's an interesting variation, in Ruby that uses map but not select. Instead we add a if clause to the map, which generates extra nils, which we squeeze out with compact. Not pretty, but it gets that nice little "if" in there. Fun.

(1..n).map{|x| 7 * x - 3 if x % 3 == 0}.compact