Programming Language Design Sketchbook: June 2010

2010-06-28

REST

REST, or representational state transfer, is a set of architectural principles used in the design of the world wide web, that have proven the key to the web's efficiency and scalability. What are these principles, and how can they be used in the design of programming languages?

Architectural Constraints

REST was described by Roy Fielding in his dissertation. He called out the following set of architectural constraints:

Client-Server — separates UI from data storage
Stateless Server — improves reliability and scalability
Client Cache — reduces some network traffic
Uniform Interface — decouples implementations from the services they provide
Layered System — means that each component only be concerned with those just below or just above it
Code-on-Demand — allows client functionality to be extended by downloading applets or scripts

Principles of REST

A RESTful system is characterized by

Addressable Resources
- Every resource has a (unique) identifier (e.g., URI)
- Example: http://www.yourcompany.com/products/2443
- NOT http://www.yourcompany.com/getProduct?id=2443
- Resources are separate from their representations
Representation-Orientation
- Requests for resources return representations
- The representation contains the state of the resource
- The client can update or delete resources through their representations
Uniform, Constrained Interface
- Small, fixed number of verbs
- Example: In HTTP, GET, PUT, POST, DELETE, HEAD, OPTIONS, TRACE only!
- An example of a non-constrained interface is an RPC-based protocol (e.g., SOAP, CORBA)
Self-Descriptive Messages
- Messages include all the information necessary to be processed
- Example: Internet Media Types
Stateless Server
- With no shared context on the server, client requests are independent
- Servers can be simpler, "more" multithreaded, easier to monitor, reliable, scalable, etc.
Cacheability
- Responses indicate whether the client can cache the data or not
- Obviously can cut down on network traffic if the client knows it doesn't have to ask for the same data again
HATEOAS
- Hypermedia as the engine of application state
- Representations should include links to related resources

Can we apply any of these principles to programming language design?

System-wide unique identifiers

The idea of identifiers having wide scope, whether unique across a system or unique across the entire world (such as URIs) exists in many forms: UUIDs, MAC addresses, public IP addresses, etc. In programming languages, Java package names are encouraged to be globally unique by using a (reversed) DNS name over which you have control.

Resource/Representation Separation

A limited, fixed, set of verbs

Self-descriptive content

Statelessness

Cachability

Linkage of Representations

2010-06-22

Eval

A common feature in many scripting languages is an eval function. Pass it a string representing source code, and the function will compile it and then interpret it. This function is called eval if the source code string represents an expression that produces a value, and might be called exec if it does not.

Eval is controversial and much-maligned because it is (1) so often abused, (2) so often used when much better alternatives exist, (3) is slow, and (4) can lead to disaster if misused. It is not rare to hear the claim "eval is evil." So the questions are: should a language allow eval and if so, how should this feature be designed?

Problems with eval

Things to watch out for:

Eval is slow because a compiler has to be launched to lex and parse the code string, prior to evaluating it.
Eval is lame because programmers may sometimes use eval without thinking. While it would be rare for anyone to blatantly pass a fixed string to eval, such as
```
eval("x = 3;");
```
one can usually get away with creating an anonymous function.
Eval is a security hole. Since the only reason for using eval is to run code that is supplied at run time, it's possible that this code may come from an untrusted or malicious source. Allowing just anyone to run code on your own machine is crazy.

When is eval not evil?

If you are going to use eval, the string must be completely sanitized. You need to check it for infinite loops, assignments, or calls with side-effects that might destroy the integrity of your application. For example, here is a JavaScript application that accepts an arithmetic expression from a user and evaluates it, first checking to make sure the input consists only of digits, parentheses, and the four basic arithmetic operators:

var s = prompt("Enter a numeric formula");
if (/[^\d()+*/-]/.test(s)) {
    alert("I don't trust that input");
} else {
    alert(eval(s));
}

Eval and Language Design

Either a language will allow evaluation of code strings or it will not. If it does, we can provide that functionality through a function or an operator. Because it is dangerous, it is definitely a candidate construct for being required to appear inside a "warning" construct, or similarly, disallowed from strict modes. Examples:

UNSAFE module Calcuator {
    ....
    // use eval here
    ....
}

use module UNSAFE;
application Calcuator {
    ....
    // use UNSAFE.eval here
    ....
}

Eval as a function

Generally, we see eval as a global function, or a member of a module entitled something like Kernel, System, or perhaps better, UNSAFE.

Eval as an operator

Another way to call attention to the use of eval is to simply make it a unary operator on strings. Perhaps:

var s = prompt("Enter an expression");
alert( `` s );

Alternatively, we might see the string to be evaled enclosed in angle or other brackets.

Disallowing Eval

Clearly we could imagine a language without an eval function or operator. C is one such language.

Forced Sanitization

An interesting new idea is for the eval function to take a second parameter which is (1) a block or anonymous function that performs sanitization, or (2) a regex which will be applied to the string so that the string will only be applied if the regex pattern matches. Example:

var s = prompt("Enter a numeric formula");
alert(eval(s, /[^\d()+*/-]/));

Throwing an exception on not matching would likely be the best course of action here.

2010-06-06

Regular Expressions

A regular expression (or regex, regexp, r.e.) is a pattern that matches a string or a portion of a string. They are used for validation, search, and find/replace. You'll find them in Java, JavaScript, Ruby, Python, Perl, and dozens of other popular languages. All languages agree on a common notation core, e.g. square brackets for character classes, ? for zero-or-one, * for zero or more, + for one-or-more, parentheses for grouping and capturing, and so on.

There's room for some notational innovations here, though. We'll cover a little of what's already in use and sketch some new ideas, too.

Examples from the Common Notation

Virtually all languages will interpret these the same way:

hello
gray|grey
gr(a|e)y
gr[ae]y
colou?r
go*gle
go+gle
g(oo)+gle
z{3}
z{3,6}
z{3,}
[Bb]rainf\*\*k
\d{5}(-\d{4})?
1\d{10}
[2-9]|[12]\d|3[0-6]
Hello\nworld
b..b
\d+(\.\d\d)?
^dog
dog$
^dog$
sh[^i]t
\d+(\.\d+([Ee][+-]?\d+)?)?
https?://(?:www\.)citysearch\.com/profile/[^/]+/(\d+)/?

Metacharacters and Escapes

Most characters stand for themsleves in a pattern. The ones that don't are called metacharacters. Most languages have 14 metacharacters (some more, some less). The usual ones are:

    |    ?    *    +    .    ^    $
    (    )    [    ]    {    }    \

If you want to use a metacharacter as a regular character, you have to escape it, e.g. \+, \[, \\, etc. The backslash also introduces dozens of other simplified expressions:

...
... TODO ...
...

Defining Patterns

How do you write a regex, such as \d+(\.\d+)? in program text? How about in a string literal?

'\d+(\.\d+)?'

Maybe.... The thing is most languages use \ to escape within a string literal. Perl and Ruby don't make the backslash special if the string literal has single quotes, but they do for double quotes, like most languages. In that case, you have to write:

"\\d+(\\.\\d+)?"

And if you need to write a regex that has to match a backslash character, the regex is \\, which looks like this in a string literal:

"\\\\"

But a regex is not a string: it has to be compiled into a little program that matches:

var r = re.compile("\\d+(\\.\\d+)?")
r = "\\d+(\\.\\d+)?".toregex
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");

But since regexes are so important, and that backslash issue gets so annoying, we'd like a language to have a special syntax for describing regexes—something that automatically compiles once, and frees us from doubly escaping backslashes. Common notations include Ruby's %r
and slash delimiters:

%r{\d+(\.\d+)?}
/\d+(\.\d+)?/

The problem with the slash approach is that you have to escape any / characters you want in the regex itself. The %r approach has so such trouble; the language's parser can figure out which is the terminating } easily.

Matchers

To use a regex you match it against some text. This allows you to

Validate that the text conforms to a desired pattern
Find (and extract) the parts of the text that do conform
Replace matching portions of the text with something else

So these matches are stateful ... TODO ... Once you have a (compiled, immutable) pattern, you use it by creating a (stateful) matcher.

Programming Language Design Sketchbook