tag:blogger.com,1999:blog-37907365197344723862024-03-08T02:00:55.350-08:00Programming Language Design Sketchbook(f)->((g)->g(g))(g)->(n)->f(g(g))(n)Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.comBlogger21125tag:blogger.com,1999:blog-3790736519734472386.post-23018750968358665002011-03-05T14:08:00.000-08:002012-01-25T23:08:51.456-08:00Type Coercion<p>A type <b>conversion</b> is an operation that takes a value of one type and returns a value of another type such that the two values are somehow equivalent. Examples include:</p>
<ul>
<li><code>float(5)</code> ⇒ <code>5.0</code> // integer to float
<li><code>string('s')</code> ⇒ <code>"s"</code> // character to string
<li><code>codepoint('A')</code> ⇒ <code>0x41</code> // character to integer
<li>A (Java) <code>FileInputStream</code> converted to a <code>ByteArrayInputStream</code> by first loading into memory
</ul>
<p>If no conversion operator explicitly appears in the code, the conversion is called a <b>type coercion</b>. Like all implicit operations, coercions can be convenient (less source code to type and read), but also dangerous (as things happen without the programmer's explicit authorization).</p>
<p>How can languages make conversions and coercions available to programmers? Coercions are usually dealt with in one of three ways:</p>
<ul>
<li>Allow all corecions
<li>Disallow all coercions
<li>Allow only certain coercions, such as subtypes to supertypes
</ul>
<h4>No Coercions Ever</h4>
<p>Languages like Ada and ML have a very simple rule: no coercions ever. One cannot even use an integer where a floating point value is expected, without explicitly casting the integer to a float.</p>
<h4>Implicit Upcasting</h4>
<p>Some languages, such as Java, allow casting in only one way, from subtypes to supertypes. This is called <strong>upcasting</strong>. You can use an integer where a float is expected, and the integer value will be coerced to the right floating point value. You can't go the other way automatically, though.</p>
<h4>Implicit Downcasting</h4>
<p>Coercions from a supertype to a subtype (<strong>downcasting</strong>) are inherently dangerous because information is being lost. In C:</p>
<pre class="code">
int x = 8.2; // x gets 8
int y = 8.5; // does y get 8 or 9?
</pre>
<p>Directly assigning a floating point value to an integer variable in its declaration is unrealistic; here's something that is more likely to occur:</p>
<pre class="code">
void f(char* s, int n) { ... }
.
.
.
f("some text", z); // z is a float, coerced to a int within f
</pre>
<p>Again, this may not be <em>too</em> bad in practice; z will be truncated or rounded and the resulting value will be passed to the parameter n. Unless z is a global variable and is assigned the value of n within f, there's no problem, right? People would never do such a thing, right?</p>
<p>Never say never! SAP has a product called Business Objects Data Integrator (BODI), with an embedded database programming language. In the <a href="http://help.sap.com/businessobject/product_guides/xir2acc/en/DIReferenceGuide.pdf">Data Integrator Reference Guide</a>, on page 453, they describe this function:</p>
<blockquote>
<b>ifthenelse</b>: <i>Allows conditional logic in expressions.</i><br>
<div style="margin-left: 2em;">
Syntax<br>
<pre>ifthenelse(condition, true_branch, false_branch)</pre>
Return value<br>
<pre>true_branch or false_branch</pre>
<em>Returns one of the values provided, based on the result of condition. The data type of the return value is the data type of the expression in true_branch. If the data type of false_branch is not convertible to the data type of true_branch, Data Integrator produces an error at validation. If the data types are convertible but don’t match, a warning appears at validation.</em></div>
</blockquote>
<p>"The type of the return value is the datatype of the expression in true_branch"? That's not exactly intuitive. Combined with the fact that BODI (or the underlying database it may use) does implicit coercions from floats to integers, users are opened up to the following disaster:</p>
<pre class="code">
P.latitude = ifthenelse(P.latitude is null, 0, P.latitude)
</pre>
<p>This is an attempt to clean up a null value by setting it to 0. But assuming the latitude is a regular floating point value, this assignment destroys the value, turning 43.2, say, into 43. If latitudes and longitudes are both "adjusted" this way, places in good neighborhoods end up in bad neighborhoods or in the middle of a river or ocean.</p>
<h4>Coercions and Type Systems</h4>
<p>Why would BODI even create an <code>ifthenelse</code> operator that would coerce an argument? This is a classic case of <em>mixing static typing with weak typing</em>. Note:</p>
<ul>
<li><b>Static Typing</b> means that all expressions must be fully typed at compile time, while <strong>dynamic typing</strong> means that types are resolved at run time.
<li><b>Strong Typing</b> means that operators applied to arguments of the wrong type generate errors, while <strong>weak typing</strong> means that the values will be coerced (wherever possible) to make the operation valid.
</ul>
<p>Static typing goes well with strong typing (ML, Ada), while dynamic typing can be combined with either strong (Ruby) or weak (Perl, JavaScript) typing, without too many surprises.</p>
<p>How should BODI's <code>ifthenelse</code> operator have been defined? Using static and weak typing together is not really <em>the</em> worst element at play here; it's the implicit downcasting of P.latitude to an integer. Why did they do this? Perhaps the designers of this language thought it would be "too complicated" to define the return type as "the most general type of true_branch and false_branch" and thought it "too restrictive" to force the two types to match exactly. They could have at least detected the fact that the false_branch had a more general type then the true_branch and generated and error there. Perhaps they thought that "too confusing." But wait, taking the type of <code>true_branch</code> as the result type is asymmetric and confusing on its own! How is that intuitive?</p>
<p>Note that with dynamic typing this problem never would have arisen since the ifthenelse expression would have produced either its second or third argument and hence taken on the type of the value returned, whatever that was. A dynamically typed language would never have even considered a coercion in that case at all! And a statically typed language with strong typing would have rejected the expression outright.</p>
<p>Static + weak is the problem. Language designers would be wise to avoid this combination.</p>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com1tag:blogger.com,1999:blog-3790736519734472386.post-41801780777529488762010-12-31T00:04:00.000-08:002011-02-06T14:01:28.216-08:00Dictionaries<p>Nearly every non-trivial application makes use of dictionaries, a.k.a. maps, associative arrays, hashes, lookup tables. What are some useful and convenient forms for these objects?</p><h4>Types of Dictionaries</h4><p>A look at the Java Collections and Concurrency APIs shows that there are quite a few kinds of dictionaries out there. Among them (note that these types are not mutually exclusive) are:</p><ul><li>HashMap</li>
<li>SortedMap</li>
<li>TreeMap</li>
<li>LinkedHashMap</li>
<li>ConcurrentMap</li>
<li>NavigableMap</li>
<li>SkipListMap</li>
<li>IdentityHashMap</li>
<li>EnumMap</li>
<li>WeakHashMap</li>
</ul><h4>Literal Forms</h4>If you've only coded in Java, C, Fortran, Pascal, Ada, or similar languages, you're missing out on the fun of writing dictionary literals. Here is a sampling of syntaxes:<br />
<pre class="code">{"CA":"Sacramento", "HI":"Honolulu", "OR": "Salem", "MA":"Boston"}
{CA:"Sacramento", HI:"Honolulu", OR: "Salem", MA:"Boston"}
{"CA" => "Sacramento", "HI" => "Honolulu", "OR" => "Salem", "MA" =>"Boston"}
{CA => "Sacramento", HI => "Honolulu", OR => "Salem", MA => "Boston"}
{:CA => "Sacramento", :HI => "Honolulu", :OR => "Salem", :MA => "Boston"}
</pre><p>The colon as a separator is probably the most popular; it appears in well-known data interchange formats like JSON and YAML. The <tt>=></tt> is called a <em>hashrocket</em>, because dictionaries are sometimes called hashes. (They shouldn't be called hashes in general; that term should only be used when you know that internally the dictionary uses a hash table for storage. Often you'll do fine with a search tree or a plain old array.)<br />
<h4>Coping without Literals</h4><p>Some languages do not allow literal forms for dictionaries. What can you do? In a dyanmically typed language, some kind of constructor taking a string with delimiters works pretty well:</p><pre class="code">def dictionary (s)
d = Dictionary.new;
pairs = s.split("|")
for pair in pairs
key, value = pair.split(",")
d.put(key, value)
end
return d
end
states = dictionary("CA,Sacramento|HI,Honolulu|OR,Salem|MA,Boston");
</pre><p>Works for this one case, but in general the delimiters should be parameters...</p><br />
<p>Java programmers can combine <em>static initializers</em> with <em>anonymous subclasses</em> to approximate literals:</p><pre class="code">Map<String, String> states = new HashMap<String, String>() {
{
put("CA", "Sacramento");
put("HI", "Honolulu");
put("OR", "Salem");
put("MA", "Boston");
}
};
</pre><h4>Testing Key Equality</h4>TODO<br />
<h4>Static Typing</h4>TODO<br />
<h4>Multimaps</h4>TODORay Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com1tag:blogger.com,1999:blog-3790736519734472386.post-74261480799476952552010-11-30T17:17:00.000-08:002010-11-30T17:17:31.995-08:00InvocationThe number of possibile ways to construct a subroutine (function, procedure, continuation, thread, task, process, etc.) invocation are enormous. There are dozens of known ways to express an invocation syntactically, and probably a dozen known semantics for argument passing. This brief article lists many current approaches, with examples, and offers some new styles.<br />
<h4>Syntactic Invocation Patterns</h4><p>There are dozens of ways to form an invocation, syntactically</p><ul><li>Prefix, parentheses required after the routine name<br />
<ul> <li><tt>f(x, y, z)</tt><br />
<li><tt>f()</tt><br />
<li><tt>f(x)</tt><br />
<li><tt>f(x, g(y, z))</tt><br />
<li><tt>f(x, g(y), z)</tt><br />
<li><tt>"+"(x, y)</tt><br />
<li><tt>op+(x, y)</tt><br />
<li><tt>operator+(x, y)</tt><br />
</ul><li><br />
</ul>
<h4>Semantics of Invocation</h4>
<h4>Semantics of Argument Passing</h4>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-38517878997021027392010-10-31T01:09:00.000-07:002010-11-01T20:33:14.414-07:00Generators<p>A <strong>generator</strong> is a entity that you can call at any time to request the next value in some sequence. For simplicity, this post will focus on a particular generator: one for Fibonacci numbers.</p><h4>Generators as objects</h4><p>In an object-oriented language, one can create a specific generator with a property holding state and a method to return the next value. In JavaScript:</p><pre class="code">var fibonacciGenerator = {
a: 0,
b: 1,
next: function () {
var result = this.b;
this.b += this.a;
return this.a = result;
}
};
</pre><p>Normally, you want to protect the generator state from tampering. One way to do this is to use a module that supports private variables. In Ruby:<br />
<pre class="code"></pre><p>Similarly, in a classical (as opposed to prototypal) OO language, you can make a class for your generator. Then different instances can be constructed, possibly with different starting points or different rules. Again in Ruby:<br />
<pre class="code"></pre><h4>Implementing Generators with Closures</h4><h4>Generators in C</h4><h4>Language Support for Generators</h4>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-36863227560595482102010-10-30T23:02:00.000-07:002010-11-30T17:57:45.846-08:00Anonymous Functions<p>A function is something you invoke, passing zero or more inputs (called arguments) and getting back zero or more outputs (called results). Functions generally have parameters (which match up with the arguments they accept), and a body, which contains the code that the function executes. What isn't necessary is that the function have a name. A function without a name is called — not surprisingly — an <strong>anonymous function</strong>.</p><p>How can we make anonymous functions easy to read and write?</p><h4>Preliminaries</h4><p>A few things to consider when designing support for anonymous functions:</p><ul><li>Should the function body be a single expression or a block with a return statement or statements?</li>
<li>What kind of syntactic device should be used to delimit the function?</li>
<li>Are functions first-class values (as in Scheme, Haskell, or JavaScript) or are simple "blocks" or "continuations" sufficient?</li>
</ul><h4>Terse Designs</h4><p>When the function body is a simple expression, there are a number of very terse ways to denote a function, many of which are used in existing languages. Here are some sketches for a function whose sole input is (an integer) x, and whose result is x/2 if x is even and 3*x+1 otherwise:</p><pre class="code">[x -> if x mod 2 = 0 then x div 2 else 3 * x + 1]
{x → x % 2 === 0 ? x / 2 : 3*x + 1}
λ x . (x / 2 if x % 2 == 0 else 3 * x + 1)
(LAMBDA (x) (IF (= (% X 2) 0) (/ X 2) (+ (* 3 X) 1)))
fn x => if x mod 2 = 0 then x / 2 else 3 * x + 1
{|x| x % 2 === 0 ? x / 2 : 3*x + 1}
function from x to x % 2 === 0 ? x / 2 : 3*x + 1 end
x to x % 2 === 0 ? x / 2 : 3*x + 1
fun (x) returning x % 2 === 0 ? x / 2 : 3*x + 1 end fun
</pre><p>For those forms which do not have special outer brackets to indicate a function expression, regular parentheses can be used to delineate them when they appear in the middle of a larger expression.</p><h4>Verbose Designs</h4><p>Function expressions that have complex bodies, such as those with several statements, including return statements can simply be written as function declarations without a name. See the JavaScript example below.</p><h4>Uses of Anonymous Functions</h4><p>There are two main uses of anonymous functions:</p><ul><li>Arguments to functions like map, filter, and reduce.<br />
</li>
<li>To act as a module, in which local variables play the role of private variables. An example in JavaScript, in which a variable is shared between two exported global functions:<br />
<pre class="code">(function () {
var x = 0;
f = function () {return x -= 5;}
g = function () {return x = 2 * Math.abs(x);}
}());
</pre></li>
</ul>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-77902522976620710272010-09-30T23:14:00.000-07:002011-03-05T18:35:42.814-08:00Values and No Values<p>Consider something as simple as recording a property for an object, say, a person's supervisor. What kind of possibilities exist? Here are some:</p><ol><li>I have a supervisor and her name is Alice.</li>
<li>I definitely have no supervisor.</li>
<li>I may or may not have a supervisor; I really don't know.</li>
<li>I do know whether I have a supervisor, but I don't care to make this information public; in other words, it's none of your business.</li>
</ol><p>How can we capture these cases in a programming language?</p><h4>Strings Only</h4><p>For languages with unsophisticated type systems (say, strings or symbols only), you might try:</p><pre class="code">var supervisor1 = "Alice";
var supervisor2 = "None";
var supervisor3 = "Unknown";
var supervisor4 = "Private Info";
</pre><p>The obvious problem here is that people can have any name at all, even "None", "Nobody", "N/A", <a href='http://xkcd.com/327/'>"Robert'); drop table students;--"</a>, or even "". You'd have to express in comments, and add logic to your application, that certain strings really aren't names. This isn't what we'd call a <em>clean</em> solution. What we want is a solution in which the values representing no (or an unknown) supervisor belong to a type or types other than plain string.</p><h4>Disjoint Sum Types</h4><p>ML and related languages feature a very clean syntax for defining flexible datatypes:</p><pre class="code">datatype Person = None | Unknown | WillNotSay | Actual of string;
val supervisor1 = Actual "Alice";
val supervisor2 = None;
val supervisor3 = Unknown;
val supervisor4 = WillNotSay;
</pre><p>This works very nicely: the type of None is Person, not string. In fact, the type of each supervisor variable above is (inferred to be) Person. While this is a nice statically-typed solution, a programmer would need to add the special constants None, Unknown, and WillNotSay to all types for which this kind of information is relevant. You can do this "once" with a polymorphic type:</p><pre class="code">datatype 'a info = None | Unknown | WillNotSay | Actual of 'a;
val supervisor1 = Actual "Alice";
val supervisor2: string info = None;
val supervisor3: string info = Unknown;
val supervisor4: string info = WillNotSay;
</pre><p>Here we've given explicit types to supervisor2, supervisor3, and supervisor4, since the None, Unknown, and WillNotSay constructors are polymorphic.</p><h4>JavaScript's null and undefined</h4><p>In a dynamically typed language like JavaScript we often have values in distinct types that represent nothingness or lack of knowlege. In JavaScript, for example:<br />
<ul><li><tt>null</tt> is the sole member of the Null type and represents the certainty of having no real value.</li>
<li><tt>undefined</tt> is the sole member of the Undefined type and represents the lack of knowledge about its real value.</li>
</ul><p>JavaScript doesn't natively distinguish between cases 3 and 4 — lack of knowledge and the refusal to share that knowledge. It could be argued that isn't really a common case anyway. Now you <em>could</em>, if you really wanted to, make this distinction somewhat like this:</p><pre class="code">var alice = {name: "Alice"; supervisor: null};
var bob = {name: "Bob"; supervisor: alice};
var eve = (name: "Eve"; supervisor: undefined};
var mallory = {name: "Mallory"}
</pre><p>This uses the lack of a supervisor property of Mallory to say she makes no claim to even having a supervisor, and isn't telling you if she has one at all. This approach is a little ugly in practice since, in JavaScript anyway, evaluating <tt>mallory.supervisor</tt> produces <tt>undefined</tt>! You'd have to dig into the object and examine its properties in order to pick up on the difference.</p><h4>Marker Objects</h4><p>Another approach that works well in a dynamically-typed language is to create the special values yourself, as simple, plain old objects. In JavaScript:</p><pre class="code">var NONE = {};
var UNKNOWN = {};
var WILL_NOT_SAY = {};
</pre><p>It should be fairly easy to figure out how to use these values.</p><p>This approach doesn't quite work as well in a statically-typed language. In Java, for example, marker objects would be implemented like this:<br />
<pre class="code">public static final Object NONE = new Object();
public static final Object UNKNOWN = new Object();
public static final Object WILL_NOT_SAY = new Object();
</pre><p>But, this means, of course, properties such as supervisor will have to be given the Java type <tt>Object</tt>, which goes against the whole point of using a statically typed language. ML's disjoint sum types are better for static languages.</p>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-13201708710714422082010-08-22T11:02:00.000-07:002010-09-05T08:50:17.660-07:00Time, Dates, and Calendars<p>Few programming languages have direct support for times and dates, leaving such support to libraries. This is fine. However, many languages (including Java and JavaScript) have time and date support in their standard libraries which is very poor. Java's Calendar and Date classes are <a href="http://stackoverflow.com/questions/1571265/why-is-the-java-date-api-java-util-date-calendar-such-a-mess">particularly pathetic</a>; JavaScript's is lame but at least fully admits its deficiencies, promising nothing. Fortunately programmers can turn to third-party libraries (including the excellent <a href="http://joda-time.sourceforge.net/">Joda-Time</a> for Java).</p><p>Here I want to convey the basic issues in representing temporal data and operations in programming, sketch out ways to make better libraries, and look at ways to include time and date support within a language.</p><h4>Basic Definitions</h4><p>Many programmers have an insufficient understanding of time and dates. Here is a short, and in many cases oversimplified, overview of the main concepts:</p><dl><dt>Time</dt>
<dd>The one-dimensional continuum of experiences.</dd>
<dt>Instant</dt>
<dd>A point in time. Because time is one-dimensional, an instant can be given with a single real number, once we choose an origin and define a unit. The origin (instant 0) of the one-dimensional space is called the <strong>epoch</strong>. Common units for the description and measurement of time include the <a href="http://en.wikipedia.org/wiki/Second">second</a> and the <a href="http://en.wikipedia.org/wiki/Millisecond">millisecond</a>.</dd>
<dt>Duration</dt>
<dd>A quantity of time, for example "17343.3422 milliseconds." Note that instants are points but durations are vectors. An instant plus a duration is an instant; An instant minus an instant is a duration.
<dt>Interval</dt>
<dd>The notion of elapsed time between two particular instants. Given an interval, you can compute its duration, but not vice versa. For example, the interval between today at noon and tomorrow at noon different from the interval between yesterday noon and today noon, even though the durations of each interval are the same.</dd>
<dt>DateTime</dt>
<dd>A description of an instant in human-friendly terms, generally, but not necessarily, based on aspects of the Earth, Sun, and Moon. DateTimes are often written with many components, e.g., <tt>2010-08-22T17:29:41-07:00</tt> has year, month, day, hour, minute, second, and offset components, respectively.</dd>
<dt>Calendar</dt>
<dd>A framework for describing datetimes. Also called a <strong>chronology</strong>. The chronologies natively supported by Joda-Time are: <ul><li>Gregorian</li>
<li>Julian</li>
<li>GJ</li>
<li>ISO</li>
<li>Buddhist</li>
<li>Coptic</li>
<li>Ethiopic</li>
</ul></dd> There are of course, many, many others.
<dt>Era</dt>
<dd>A large division of dates within a calendar. For example, the GJ calendar contains two eras: BCE and CE (also known as BC and AD).</dd>
<dt>Partial</dt>
<dd>A fragment of a datetime, leaving parts unspecified. For example: "August 1", "14:40". These are useful to describe recurring events, such as birthdays. They generally do not map to any particular instant or interval, but rather to many.</dd>
<dt>TimeZone</dt>
<dd>An optional component of a date time that allows people to have similar associations between the time portion of a datetime and the position of the sun relative to their location on Earth. Usually represented as an offset (duration), which can vary according to local law (e.g., Daylight Saving).</dd>
<dt>Period</dt>
<dd>Similar to a duration, but expressed with human-friendly terms like months, weeks, and days. A period does not have an exact number of milliseconds; for example "2 months" can have a different number of days, depending on the time of year; and even "1 day" can have 22, 23, or 24 hours when daylight time is taken into account.</dd> </dl><h4>A Time Library</h4><p>A good time library should maintain types for each of the above concepts, as well as a rich set of parsing and formatting routines for dates, conversions between types, and operators. Joda-Time qualifies.</p><p>The ability to construct new calendars (chronologies) is also desirable. What would such a capability look like?</p>TODO<br />
<h4>Native Language Support for Time</h4><p>Because the concepts of time of dates are complex, language-level support for all of the concepts introduced above can become quite cumbersome. How many reserved words or special operators can a language, especially a general purpose one, absorb? How many literal forms should be added?</p><br />
TODORay Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-25688479777691917072010-07-22T18:38:00.000-07:002010-10-14T00:46:23.774-07:00Default Parameters<p>Subroutine calls can be simplified when one doesn't have to provide arguments for every parameter.</p><h4>Specifying defaults in the signature</h4><p>If you're lucky, your language might allow your signature to specify defaults:</p><pre class="code">function f(x = 3, y = 5, z = 10) {
...
}
f(); <span class="comment">// same as f(3, 5, 10);</span>
f(44); <span class="comment">// same as f(44, 5, 10);</span>
f(32, 1); <span class="comment">// same as f(32, 1, 10);</span>
f(17, 15, 22);
</pre><p>Other forms for specifying defaults:</p><pre class="code">(DEFUN f ((x 3) (y 5) (z 10)) ...)
def f(x := 3, y := 5, z := 10) ... end
</pre><h4>Supplying defaults in the subroutine body</h4><p>Some languages don't support a syntax in which defaults can be supplied in a signature, but they do let calls be made with fewer arguments than parameters. Parameters for missing arguments automatically receive a default value like <strong>nil</strong> or <strong>undefined</strong>. Whatever the value, these languages almost always treat it as falsy, allowing the use of an <tt>||=</tt> assignment, as in:</p><pre class="code">function f(x, y, z) {
x ||= 3;
y ||= 5;
z ||= 10;
...
}
</pre><p>The problem here is that other values which you might want to pass in might be treated as falsy by a language (such as 0 or the empty string), so this approach will not work in general. The idea can be implemented if the language has an operator to determine whether a value is "defined":</p><pre class="code">function f(x, y, z) {
x = defined x ? x : 3;
y = defined y ? y : 5;
z = defined z ? z : 10;
...
}
</pre><pre class="code">function f(x, y, z) {
x = 3 if not defined x;
y = 5 if not defined x;
z = 10 if not defined x;
...
}
</pre><pre class="code">function f(x, y, z) {
if (!defined(x)) {
x = 3;
}
if (!defined(y)) {
y = 5;
}
if (!defined(z)) {
z = 10;
}
...
}
</pre><p>These forms aren't as pretty because of the logic required in the body as opposed to the signature.</p><h4>Overloading</h4><p>Some languages allow neither defaults to be specified nor calls with too few arguments. In this case you can simulate defaults with overloading. In Java:</p><pre class="code">public void f(int x, int y, int z) {...}
public void f(int x, int y) {f(x, y, 10);}
public void f(int x) {f(x, 5, 10);}
public void f() {f(3, 5, 10);}
</pre>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-11034930455285577722010-06-28T16:40:00.000-07:002010-10-14T00:47:01.792-07:00REST<p>REST, or representational state transfer, is a set of architectural principles used in the design of the world wide web, that have proven the key to the web's efficiency and scalability. What are these principles, and how can they be used in the design of programming languages?</p><h4>Architectural Constraints</h4><p>REST was described by <a href="http://roy.gbiv.com/">Roy Fielding</a> in his <a href="http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm">dissertation</a>. He called out the following set of architectural constraints:</p><ul><li>Client-Server — separates UI from data storage<br />
<li>Stateless Server — improves reliability and scalability<br />
<li>Client Cache — reduces some network traffic<br />
<li>Uniform Interface — decouples implementations from the services they provide<br />
<li>Layered System — means that each component only be concerned with those just below or just above it<br />
<li>Code-on-Demand — allows client functionality to be extended by downloading applets or scripts<br />
</ul><h4>Principles of REST</h4><p>A RESTful system is characterized by</p><ul><li>Addressable Resources<br />
<ul><li>Every resource has a (unique) identifier (e.g., URI)</li>
<li>Example: <tt>http://www.yourcompany.com/products/2443</tt></li>
<li><strong><big>NOT</big></strong> <tt>http://www.yourcompany.com/getProduct?id=2443</tt></li>
<li>Resources are separate from their representations</li>
</ul></li>
<li>Representation-Orientation<br />
<ul><li>Requests for resources return representations</li>
<li>The representation contains the state of the resource</li>
<li>The client can update or delete resources through their representations</li>
</ul></li>
<li>Uniform, Constrained Interface<br />
<ul><li>Small, fixed number of verbs</li>
<li>Example: In HTTP, GET, PUT, POST, DELETE, HEAD, OPTIONS, TRACE only!</li>
<li>An example of a non-constrained interface is an RPC-based protocol (e.g., SOAP, CORBA)</li>
</ul></li>
<li>Self-Descriptive Messages<br />
<ul><li>Messages include all the information necessary to be processed</li>
<li>Example: Internet Media Types</li>
</ul></li>
<li>Stateless Server<br />
<ul><li>With no shared context on the server, client requests are independent</li>
<li>Servers can be simpler, "more" multithreaded, easier to monitor, reliable, scalable, etc.</li>
</ul></li>
<li>Cacheability<br />
<ul><li>Responses indicate whether the client can cache the data or not</li>
<li>Obviously can cut down on network traffic if the client knows it doesn't have to ask for the same data again</li>
</ul></li>
<li>HATEOAS<br />
<ul><li>Hypermedia as the engine of application state</li>
<li>Representations should include links to related resources</li>
</ul></li>
</ul><p>Can we apply any of these principles to programming language design?</p><h4>System-wide unique identifiers</h4>The idea of identifiers having wide scope, whether unique across a system or unique across the entire world (such as URIs) exists in many forms: UUIDs, MAC addresses, public IP addresses, etc. In programming languages, Java package names are encouraged to be globally unique by using a (reversed) DNS name over which you have control. <h4>Resource/Representation Separation</h4><h4>A limited, fixed, set of verbs</h4><h4>Self-descriptive content</h4><h4>Statelessness</h4><h4>Cachability</h4><h4>Linkage of Representations</h4>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-12261196553725888532010-06-22T21:36:00.000-07:002010-10-13T11:45:43.085-07:00Eval<p>A common feature in many scripting languages is an <tt>eval</tt> function. Pass it a string representing source code, and the function will compile it and then interpret it. This function is called <tt>eval</tt> if the source code string represents an expression that produces a value, and might be called <tt>exec</tt> if it does not.</p><p><tt>Eval</tt> is controversial and much-maligned because it is (1) so often abused, (2) so often used when much better alternatives exist, (3) is slow, and (4) can lead to disaster if misused. It is not rare to hear the claim "eval is evil." So the questions are: should a language allow <tt>eval</tt> and if so, how should this feature be designed?</p><h4>Problems with eval</h4><p>Things to watch out for:</p><ul><li>Eval is <strong>slow</strong> because a compiler has to be launched to lex and parse the code string, prior to evaluating it.</li>
<li>Eval is <strong>lame</strong> because programmers may sometimes use eval without thinking. While it would be rare for anyone to blatantly pass a fixed string to eval, such as<br />
<pre class="code">eval("x = 3;");
</pre><p>one can usually get away with creating an anonymous function.</p></li>
<li>Eval is a <strong>security hole</strong>. Since the only reason for using eval is to run code that is supplied at run time, it's possible that this code may come from an untrusted or malicious source. Allowing just anyone to run code on your own machine is crazy.</li>
</ul><h4>When is eval not evil?</h4><p>If you are going to use <tt>eval</tt>, the string must be completely <em>sanitized</em>. You need to check it for infinite loops, assignments, or calls with side-effects that might destroy the integrity of your application. For example, here is a JavaScript application that accepts an arithmetic expression from a user and evaluates it, first checking to make sure the input consists only of digits, parentheses, and the four basic arithmetic operators:<br />
<pre class="code">var s = prompt("Enter a numeric formula");
if (/[^\d()+*/-]/.test(s)) {
alert("I don't trust that input");
} else {
alert(eval(s));
}
</pre><h4>Eval and Language Design</h4><p>Either a language will allow evaluation of code strings or it will not. If it does, we can provide that functionality through a function or an operator. Because it is dangerous, it is definitely a candidate construct for being required to appear inside a "warning" construct, or similarly, disallowed from strict modes. Examples:</p><pre class="code">UNSAFE module Calcuator {
....
// use eval here
....
}
</pre><pre class="code">use module UNSAFE;
application Calcuator {
....
// use UNSAFE.eval here
....
}
</pre><h4>Eval as a function</h4><p>Generally, we see <tt>eval</tt> as a global function, or a member of a module entitled something like <tt>Kernel</tt>, <tt>System</tt>, or perhaps better, <tt>UNSAFE</tt>.</p><h4>Eval as an operator</h4><p>Another way to call attention to the use of eval is to simply make it a unary operator on strings. Perhaps:</p><pre class="code">var s = prompt("Enter an expression");
alert( `` s );
</pre><p>Alternatively, we might see the string to be evaled enclosed in angle or other brackets.</p><h4>Disallowing Eval</h4><p>Clearly we could imagine a language without an eval function or operator. C is one such language.</p><h4>Forced Sanitization</h4><p>An interesting new idea is for the eval function to take a second parameter which is (1) a block or anonymous function that performs sanitization, or (2) a regex which will be applied to the string so that the string will only be applied if the regex pattern matches. Example:</p><pre class="code">var s = prompt("Enter a numeric formula");
alert(eval(s, /[^\d()+*/-]/));
</pre><p>Throwing an exception on not matching would likely be the best course of action here.</p>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-22331430041453933102010-06-06T02:42:00.000-07:002010-06-06T15:14:26.631-07:00Regular Expressions<p>A regular expression (or regex, regexp, r.e.) is a pattern that matches a string or a portion of a string. They are used for validation, search, and find/replace. You'll find them in Java, JavaScript, Ruby, Python, Perl, and dozens of other popular languages. All languages agree on a common notation core, e.g. square brackets for character classes, <tt>?</tt> for zero-or-one, <tt>*</tt> for zero or more, <tt>+</tt> for one-or-more, parentheses for grouping and capturing, and so on.</p><p>There's room for some notational innovations here, though. We'll cover a little of what's already in use and sketch some new ideas, too.</p><h4>Examples from the Common Notation</h4><p>Virtually all languages will interpret these the same way:</p><pre class="code">hello
gray|grey
gr(a|e)y
gr[ae]y
colou?r
go*gle
go+gle
g(oo)+gle
z{3}
z{3,6}
z{3,}
[Bb]rainf\*\*k
\d{5}(-\d{4})?
1\d{10}
[2-9]|[12]\d|3[0-6]
Hello\nworld
b..b
\d+(\.\d\d)?
^dog
dog$
^dog$
sh[^i]t
\d+(\.\d+([Ee][+-]?\d+)?)?
https?://(?:www\.)citysearch\.com/profile/[^/]+/(\d+)/?
</pre><h4>Metacharacters and Escapes</h4><p>Most characters stand for themsleves in a pattern. The ones that don't are called <em>metacharacters</em>. Most languages have 14 metacharacters (some more, some less). The usual ones are:</p><pre> | ? * + . ^ $
( ) [ ] { } \
</pre><p>If you want to use a metacharacter as a regular character, you have to escape it, e.g. <tt>\+</tt>, <tt>\[</tt>, <tt>\\</tt>, etc. The backslash also introduces dozens of other simplified expressions:<br />
<pre>...
... TODO ...
...
</pre><h4>Defining Patterns</h4><p>How do you write a regex, such as <tt>\d+(\.\d+)?</tt> in program text? How about in a string literal?</p><pre class="code">'\d+(\.\d+)?'
</pre><p>Maybe.... The thing is most languages use <tt>\</tt> to escape within a string literal. Perl and Ruby don't make the backslash special if the string literal has single quotes, but they do for double quotes, like most languages. In that case, you have to write:</p><pre class="code">"\\d+(\\.\\d+)?"
</pre><p>And if you need to write a regex that has to match a backslash character, the regex is <tt>\\</tt>, which looks like this in a string literal:</p><pre class="code">"\\\\"
</pre><p>But a regex is <em>not</em> a string: it has to be <em>compiled</em> into a little program that matches:</p><pre class="code">var r = re.compile("\\d+(\\.\\d+)?")
r = "\\d+(\\.\\d+)?".toregex
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");
</pre><p>But since regexes are so important, and that backslash issue gets so annoying, we'd like a language to have a special syntax for describing regexes—something that automatically compiles once, and frees us from doubly escaping backslashes. Common notations include Ruby's <tt>%r</tt><br />
and slash delimiters:</p><pre class="code">%r{\d+(\.\d+)?}
/\d+(\.\d+)?/
</pre><p>The problem with the slash approach is that you have to escape any <tt>/</tt> characters you want in the regex itself. The <tt>%r</tt> approach has so such trouble; the language's parser can figure out which is the terminating <tt>}</tt> easily.</p><h4>Matchers</h4><p>To use a regex you <em>match</em> it against some text. This allows you to</p><ul><li>Validate that the text conforms to a desired pattern</li>
<li>Find (and extract) the parts of the text that do conform</li>
<li>Replace matching portions of the text with something else</li>
</ul><p>So these matches are stateful ... TODO ... Once you have a (compiled, immutable) pattern, you use it by creating a (stateful) matcher.<br />
<h4>Character Classes</h4><h4>Quantifiers</h4><h4>Groups and Capturing</h4><h4>Backreferences</h4><h4>Boundaries</h4><h4>Lookaround</h4><h4>Modifiers</h4>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-72522474679814618302010-05-20T09:08:00.000-07:002010-12-01T21:29:13.619-08:00List Comprehensions<p>A list comprehension is a compact syntactic form for a list (or array) that is built from applying an operation to certain elements of an existing list (or array). For example:</p><pre class="code">[7*x-3 | x in 1..n | x mod 3 == 1]
</pre><p>means the list of all values made by computing 7 * x - 3, for each x in 1 through n where x mod 3 is 1. If n is 10, for example, the generated list is [4, 25, 46, 67]. The three parts of the comprehension are called the output (expression), the input (range) and the filter (predicate).</p><h4>Common uses of comprehensions</h4><pre class="code">[7*x-3 | x in 1..n | x mod 3 == 1] // operate on (map) and filter a range
[7*x-3 | x in a | x mod 3 == 1] // map and filter an existing list
[7*x-3 | x in 1..n | true] // no filtering, just map
[x | x in 1..100 | isPrime(x) ] // just a filter, no map
</pre><h4>Sketches</h4><pre class="code">{7*x-3 | x in 1..n | x mod 3 == 1}
[7*x-3 for x in a where x mod 3 == 1]
[7*x-3 for x in a if x mod 3 == 1] // Python
[7*x-3 for each (i in a) if (x % 3 == 1)] // JavaScript 1.7 (uncommon)
for (i <- (1 to 10) if x % 3 == 1) yield 7*x-3 // Scala
[7*x-3, x in 1..n, x mod 3 == 1]
{7*x-3 : x ∈ 1..n ∧ x mod 3 == 1} // Plain old Math!!
[7*x-3 : x ∈ 1..n, x mod 3 == 1]
[7*_-3 | a | _ mod 3 == 1] // implicit variable
</pre><h4>Complex input sets</h4>How do we handle the idea of two range variables for the input set? We can write two range expressions or range over a set of tuples:<br />
<br />
<pre class="code">[2*x+y | x in 1..n, y in 'a'..'z' | x mod 3 == 1 and isVowel(y)]
[(x, y*y) | (x, y) ∈ a × (1..10) | x < y]
[x + y for x in a for y in b] // Python
</pre><h4>Life without comprehensions</h4>If you don't have comprehensions, or a compact, symbolic list append operator, you end up writing this:<br />
<pre class="code">List<Integer> a = new List<Integer>();
for (int x = 1; x <= n; x++) {
if (x % 3 == 1) {
a.append(7 * x - 3);
}
}
</pre>Many programmers like that kind of coding. More verbose, but everyone can understand it.<br />
In languages that treat <tt>for</tt> and <tt>if</tt> as suffixes to simple statements (like Perl and Ruby), we get compact forms that can be nearly as terse as comprehensions:<br />
<pre class="code">a = [];
for x in 1..n
a << 7 * x - 3 if x mod 3 == 1
end
</pre><pre class="code">a = [];
a << 7 * x - 3 if x mod 3 == 1 for x in 1..n
</pre><pre class="code">a = [];
a << 7 * x - 3 for x in 1..n where x mod 3 == 1
</pre><p>And, of course, there are <tt>map</tt> and <tt>filter</tt> (or <tt>select</tt> as it is called in Ruby)...</p><pre class="code">(1..n).filter(x -> x mod 3 == 1).map(x -> 7 * x - 3)
(1..10).select{|x| x % 3 == 1}.map{|x| 7 * x - 3} // Ruby
(1 to n) filter { _%3==1 } map { 7*_-3 } // Scala
map((x -> 7 * x - 3), filter((x -> x mod 3 == 1), (1..n)))
</pre><p>Try the previous examples with your favorite form of <a href="http://language-design.blogspot.com/2010/03/lambdas.html">lambda expression</a>.</p><p>Here's an interesting variation, in Ruby that uses map but not select. Instead we add a <tt>if</tt> clause to the map, which generates extra <tt>nil</tt>s, which we squeeze out with <tt>compact</tt>. Not pretty, but it gets that nice little "if" in there. Fun.<br />
<pre class="code">(1..n).map{|x| 7 * x - 3 if x % 3 == 0}.compact
</pre>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com3tag:blogger.com,1999:blog-3790736519734472386.post-88812764598665848562010-05-20T08:21:00.000-07:002010-10-14T00:44:29.103-07:00Information Hiding<p><em>Information hiding</em> occurs when a language allows some program entities to be invisible to parts of a program. When applied to data within an object or module, we usually use the more specific term <em>encapsulation</em>. Information hiding is a good thing since it can make it harder to write certain types of mailicious code, and make it harder to accidentally ruin parts of a system you didn't mean to touch.</p><p>Here are some different language forms for information hiding.</p><h4>Explicit information hiding</h4><p>A direct approach is to use a modifier — a keyword or symbol that defines an access level for an entity. Here are various possible forms:</p><pre class="code">private var x;
protected double x;
private protected var x;
package protected boolean x;
package local var x;
public array of int x;
x: public string;
-int x;
+int x;
-(int) x;
+(int) x;
</pre><p>These modifiers can be used to hide a class within a package, a member within a class, or just about anything within a module. Visibility of fields for example, can be restricted to</p><ul><li>The object in which it appears</li>
<li>All objects of the same class in which it appears</li>
<li>The package in which it appears</li>
<li>Its class and all its class's subclasses</li>
<li>Its class and subclasses <em>and</em> the same package</li>
</ul><p>...and it can have no restrictions at all (this generally uses the modifier <tt>public</tt>).</p><p>For languages that favor symbols over words, you could use something like</p><ul><li>- for most-restricted (private)</li>
<li># for semi-restricted or (protected or package-private or package-protected)</li>
<li>+ for least-restricted or (public)</li>
</ul><p>A good case can be made for <tt>+</tt> and <tt>-</tt> ... not sure about others.</p><h4>Implicit encapsulation</h4><p>The other approach is to implement information hiding simply by the position of entity declarations. This is normally done by nesting, for example, local variables are almost always automatically private to a function (or method or procedure) body:</p><pre class="code">function f(x) {
var y;
...
}
</pre><p>Parameters are usually private to a function too. We say "usually" because in languages that support named parameter association, calling code can see the parameter name, but it can only be used in the call, so it's plenty safe.</p><p>We can have variables that are local to a block, or an iteration statement:</p><pre class="code">for i in x..y
<span class="comment"># Some code using i</span>
end
</pre><pre class="code">x.upto(y) do |i|
<span class="comment"># Some code using i</span>
end
</pre><p>One interesting style is to have a module in which there is one section (usually called the <em>interface</em>) which holds the visible identifiers and another section (the <em>implementation</em>) holding private stuff. A sketch:</p><pre class="code">module Stacks
type Stack(int capacity);
vpid push(Object)
Object pop()
int size()
implementation
<span class="comment">Any variable defined here is private</span>
<span class="comment">to this entire implementation section</span>
. . .
<span class="comment">Here we expand on the interface items</span>
type Stack(int capacity) is
int top = 0;
Object[] data = Array(Object).new(capacity)
end
void push(Object x)
raise Overflow if top == capacity;
data[top++] = x
end
Object pop()
raise Underflow if top == 0;
data[top--]
end
int size()
top
end
end
</pre><h4>Hybrid Approaches</h4><p>We can also have modules in which <em>everything</em> is private (hidden) by default — and an explicit export is needed.</p><pre class="code">module m
export a; <span class="comment">// but not b</span>
int a, b;
. . .
end
</pre><p>In some cases modules are completely closed off and you need an explicit import also.</p><pre class="code">int x, y;
. . .
module m
import x; <span class="comment">// but not y</span>
export a; <span class="comment">// but not b</span>
int a, b;
. . .
end
</pre>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com2tag:blogger.com,1999:blog-3790736519734472386.post-81034102156557334802010-05-02T13:26:00.000-07:002010-05-23T19:54:10.518-07:00Reserved Words<p>Many languages <em>reserve</em> words for their own use, meaning that you, the programmer, cannot use these words for your own purposes. Reserved words are related to <strong>keywords</strong> which are words used for certain purposes by the language, but which you can also use as you wish.</p><p>What is the motivation for a language to have reserved words? Are reserved words a good thing or not? Is there such thing as different degrees of being reserved?</p><h4>Why reserve words at all?</h4><p>A language might want to reserve words to limit confusion. The following are legal statements in PL/1, a language which chose not to reserve IF, THEN, ELSE, DO, and similar keywords:</p><pre class="code">IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;
IF IF THEN ELSE = IF; ELSE THEN = ELSE;
DO WHILE (WHILE = DO); END = WHILE + DO; END;
</pre><p>Would anyone really write code like that? Even if someone felt the need to abuse keywords like that, simply reserving them doesn't completely prevent programmers from writing confusing code in other ways. There's no stopping this:</p><pre class="code">int subtract(int x, int y) {
return x + y;
}
</pre><p>Perhaps the real reason languages reserve words is to make things easier for the compiler, or to give the language authors an outlet in which to be funny. For example, the word <tt>goto</tt> is reserved in Java, even though it isn't used for anything. Of course, nothing prevents a Java programmer from using the word in package names, comments, and string literals. And, as Java is case-sensitive, there's always <tt>class Goto ... </tt>.</p><h4>Why <em>not</em> reserve words?</h4><p>There are a few reasons for not reserving words:</p><ul><li>Programmers should have the freedom to choose their own identifiers. It's the compiler's job to figure out when words are used in a particular context.</li>
<li>It's annoying to programmers. There are thousands of Java programs in the world using variable and parameter names called <tt>klass</tt>, <tt>clazz</tt>, or <tt>c</tt>, because <tt>class</tt> is verboten. (Awesome hackers might even use <tt>clаss</tt> — haha, the third letter there is a Cyrillic small letter a, not a Latin one!)</li>
<li>As a language evolves, the addition of new reserved words will break old code.</li>
</ul><h4>Keywords versus symbols</h4><p>Here are three alternative syntaxes for a certain statement:</p><pre class="code">if greater(x, y) and w then begin assign x plus 4 to z; print z; end;
if (x > y && w) {z = x + 4; print z}
(x > y && w ==> z = x + 4, !z)
</pre><p>The obvious things to notice are that too many keywords make code too verbose and readability goes down. Too few keywords gives you cryptic code. Most people understand + and * and /, but really, a <tt>!</tt> for "print"? In fact, most symbols aren't obvious at first. Different languages use = differently. And what about these things: <=>. =~. %. ===, ^, *., #, ??, and who knows what else? Anyone use <a href="http://en.wikipedia.org/wiki/APL_(programming_language)">APL</a>?</p><p>Symbols are sometimes preferred to keywords for another reason: if keywords are in a language you don't speak, they might be harder to learn than symbols. Programmers that know Spanish but not English may prefer 'y' to 'and' and 'mientras' to 'while'. This is one use for macros, I suppose. In C:</p><pre class="code">#define mientras while
</pre><h4>Symbols as keywords</h4><p>In ML, identifiers can be alphanumeric or symbolic. So <tt>+</tt>, <tt><=*=></tt> and <tt>====></tt> are all identifiers. You can define functions with these names. And you can define <tt>+</tt> to do subtraction. You can even do that in Ruby. Is this okay? The choices seem to be:</p><ul><li>Allow symbolic identifiers, but treat some as reserved words so that no programmer can override such basic operators as <tt>+</tt>.</li>
<li>Allow symbolic identifiers but place no restrictions on them. After all, the reason languages allow "operator overloading" is to allow addition on vectors and other structured objects.</li>
<li>Disallow symbolic identifiers but permit operator overloading (for the fixed set of predefined symbolic operators).</li>
<li>Prohibit both symbolic identifiers and operator overloading.</li>
</ul><h4>How reserved is reserved?</h4><p>When a language specification says that a word is reserved, it generally means one of two things:</p><ul><li>The word cannot, under any circumstances, be used as a name for any programmer-defined entity (e.g., a variable, constant, function, type, object property name, method, function parameter, statement label, etc.)</li>
<li>The word can be used to name some programmer-defined entities, but its position in the source code text is restricted to simplify language processing (specifically, tokenization).</li>
</ul><p>The former case may be stricter than necessary because <em>quoting</em>, for one thing, works pretty well for object properties (at least):<br />
<pre class="code">point["long"] = 118.3532;
</pre><p>You can't stop people from using reserved words in string literals, after all. Similar to quoting is escaping; for example, you would write something like :</p><pre class="code">var @while = 2;
</pre><p>The <tt>@</tt> is not part of the variable name, it's just an escape character that tells the compiler the variable is really named 'while' but we're going to use it as an identifier and not as the beginning of a while statement.</p><p>The second case is interesting. JavaScript, for example, reserves the word "long" and while it allows</p><pre class="code">point["long"] = 118.3532;
</pre><p>it does <em>not</em> allow</p><pre class="code">point.long = 118.3532;
</pre><p>The probable reason for this is that it wants its lexical analyzer to be independent of context: for it to simply look that the above text and tokenize it into</p><pre> IDENTIFIER ("point")
DOT
KEYWORD_LONG
ASSIGNMENT_OPERATOR
NUMBER ("118.3532")
</pre><p>But is that really important in practice anymore? Lexical analyzers can easily choose a different meaning for a token if, say, the previous token is a dot! That would allow <tt>point.long</tt>. Now what about <tt>var point = {lat: 40, long: -110}</tt>? It would seem we'd have to allow the lexical analyzer to have access to the parser's state, but really, just wait until the semantic analysis phase and allow certain known keyword tokens to be just as good as identifiers when it comes to object property names.</p>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com3tag:blogger.com,1999:blog-3790736519734472386.post-55494405822273419802010-04-29T19:18:00.000-07:002010-10-14T00:43:40.533-07:00Object ConstructionThere are countless ways to build objects. Here are a few.<br />
<h4>Classic Constructors</h4>Many class-based object oriented languages have <em>constructors</em> that look suspiciously like methods. Unlike instance methods which operate on existing objects, though, these constructors bring new objects into existence, and generally use a special syntax (such as an operator called <tt>new</tt>) to distinguish their invocation from that of a method.<br />
<pre class="code">new Point("Pantages Theater", 34.1017, -118.3255,
"Hollywood", "Los Angeles", "CA", "us", "90028",
new LocalDate(1930, 6, 4), Status.ACTIVE)
</pre>If there are too many parameters, you can pass in a map (dictionary, associative array, hash), or your language might support named parameter association.<br />
<h4>Setters</h4>With setter methods, you first construct an object in which all of its fields are set to default values, then you customize the object by setting just those fields that differ from the defaults.<br />
<pre class="code">Point p = new Point();
p.setName("Pantages Theater");
p.setLatitude(34.1017);
p.setLongitude(-118.3255);
p.setNeighborhood("Hollywood");
p.setCity("Los Angeles");
p.setState("CA");
p.setCountry("us");
p.setPostalCode("90028");
p.setEstablished(new LocalDate(1930, 6, 4));
p.setStatus(Status.ACTIVE);
</pre>This approach solves the three problems above, but introduces a new problem: the object being constructed can't be immutable. How can you get around this?<br />
<ul><li>Some languages allow fields to be marked "final" so that after the first set, they can't be set again.<br />
<li>In some languages, you can scope the setter methods so that only the parts of the code that are supposed to the construct the objects can see the setters. Since objects should be able to be constructed anywhere, this all but requires the programmer to factor out construction code into its module, which could be unwieldy. But there are ways to do it nicely in general; one way is the builder, described next.<br />
</ul><h4>Builders</h4>Builder objects are common too. They have the three advantages of the setter approach (no need to supply non-default values, order doesn't matter, and reader knows what the arguments mean), while also allowing the newly constructed object to be immutable. <pre class="code">Point p = new PointBuilder()
.name("Pantages Theater");
.latitude(34.1017);
.longitude(-118.3255);
.neighborhood("Hollywood");
.city("Los Angeles");
.state("CA");
.country("us");
.postalCode("90028");
.established(new LocalDate(1930, 6, 4));
.status(Status.ACTIVE)
.build();
</pre><h4>Factory Methods</h4><p>A factory method is any regular method (or regular function) that creates and returns a new object. There's no special syntax here; all that is done is to call a constructor, builder, or other device inside the body of the factory method. In Java:</p><pre class="code">public class Card {
private Suit suit;
private Rank rank;
private Card(Suit s; Rank r) { // Constructor must be private
suit = s; // to force all clients to use
rank = r; // the factory method below
}
public static Card fromRankAndSuit(Rank r, Suit s) {
return new Card(s, r);
}
. . .
}
</pre><h4>Literals</h4><p>In languages without classes, or when classes are not too important (say, when there is duck typing), object literals create new objects. Here's a JavaScript example. Fields are string literals, but we get to omit the quotes if the field name is a simple word that is not a reserved word:</p><pre class="code">{name: "Pantages Theater", lat: 34.1017, "long": -118.3255,
neighborhood: "Hollywood", "state or province": "CA",
country: "us", "postal code": "90028", city: "Los Angeles",
established: new LocalDate("1930-06-04"), status: Status.ACTIVE)}
</pre><p>Object literals are also possible in languages with statically typed, closed classes. In Ada:</p><pre class="code">type Point is record
X: Integer;
Y: Integer;
end record;
. . .
var P: Point := Point'(6, 5);
var Q: Point := Point'(Y => 5, X => 6);
</pre><h4>Construction in Prototypal Languages</h4><p><a name="es5"></a> Modern (ECMAScript 5-based) JavaScript has nice features for creating objects with a given prototype. I came up with this pattern:</p><pre class="code">/*
* An immutable circle datatype. Synopsis:
*
* var c = Circle.create(5, {x: 1, y: 4});
* c.radius => 5
* c.center => {x: 1, y: 4}
* c.area() => 25π
* c.perimeter() => 10π
*/
var Circle = {
prototype: {
area: function () {return Math.PI * this.radius * this.radius;},
perimeter: function () {return 2 * Math.PI * this.radius;}
},
create: function (r, c) {
return Object.create(this.prototype, {
center: {value: c},
radius: {value: r}
});
}
};
</pre><p>The circles we create with <tt>Circle.create</tt> are immutable because we use JavaScript's default property descriptors for <tt>center</tt> and <tt>radius</tt>: they're not writable, not enumberable, we can't delete them, nor can we change these settings. The same isn't true for the <tt>Circle</tt> properties (<tt>prototype</tt> and <tt>create</tt>), nor for the prototype's own properties...is it worth using property descriptors on these too?</p>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-51900504354315227742010-04-17T11:37:00.000-07:002010-10-13T09:50:20.996-07:00Global Variables<p>The concept of a global variable seems pretty simple, especially to beginning programmers writing a single-file program in a block-structured language. But for languages with modules, or languages that support multithreading or multiprocessing, or with scripting languages embedded in other systems that expose "host" objects via script variables, things are not so clear.</p><p>Most programmers would warn that globals should be used sparingly, for various reasons:</p><ul><li>Because they can be written to from anywhere, it's harder to tell what might be going on just by looking at local portions of code. Code is harder to debug.<br />
<li>Globals make multithreaded programming harder to reason about, and make race conditions more likely to pop up.<br />
<li>Parameters are almost always a more readable and understandable way to "share" data among routines.<br />
</ul><p>How do we live with, or manage, global variables? What constructs can a language have to mitigate these problems?</p><h4>Languages that Require Global Variables</h4><p>There's an interesting distinction between languages that <em>allow</em> global variables and those that <em>require</em> them. If a language allows pieces of a program to be brought together from different physical components (such as files), but does not provide a programming language construct like a module or package, then global variables are basically required to share information. Even if you adopt a message-passing type of communication between functions in different files, without a nice module construct, those functions are referenced with global variables.</p><h4>Modules</h4><p>Many languages have modules for the express purpose of limiting the number of global variables and hence minimizing the number of name collisions. The simplest example is that of a math module, found in most modern languages. Instead of global variables for E, PI, sin, cos, tan, etc., these values are wrapped in a module construct (usually called Math), which is the only top-level name exposed to the rest of the code. Examples:</p><pre class="code"># Ruby-like
module Math
def sin(x) ... end
def cos(x) ... end
PI = 3.141592653589793
E = 2.718281828459045
end
</pre><pre class="code">// JavaScript-like
var Math = {
sin: function (x) {...},
cos: function (x) {...},
PI: 3.141592653589793,
E: 2.718281828459045
};
</pre><pre class="code"># Java-like
public class Math {
private Math() {}
public static double sin(double x) {...}
public static double cos(double x) {...}
public static double PI = 3.141592653589793;
public static double E = 2.718281828459045;
}
</pre><p>Modules create a namespace for these values which would otherwise be global. If there is a chance of having too many modules (that is, a possibility of having names of modules collide), modules can be grouped into some higher level named collection (Java calls this a <em>package</em>). Whether or not the additional construct is used, one can always use a hierarchical naming convention for global entities, starting, for instance, with a (reversed) DNS name that the programmer "owns."</p><h4>Multithreading</h4><p>The dangers of unprotected global variables in multithreading code (e.g., race conditions, lost updates) are so well known we don't need to explain them here. If you are stuck with a global, shared, variable for communication between threads, your language might provide some mechanism for atomic updating. For example:
<pre class="code">atomic var balance = 0;
</pre><p>or</p><pre class="code">AtomicInteger balance = new AtomicInteger(0);
</pre><h4>Internet Explorer Events</h4><p>In most event-driven systems, an event object is passed to event handlers. If you are using JavaScript, and are writing event handlers to function on any browser except Internet Explorer, this will be the case. You write code like:</p><pre class="code">document.body.onclick = function (e) {
alert("Clicked at (" + e.clientX + "," + e.clientY + ")");
}
</pre><p>With Internet Explorer, the event object is not passed to your handler; instead, the most recently fired event is accessed by a global variable (no kidding!) called <tt>event</tt>. IE gets away with this because client-side JavaScript is single-threaded: all events are queued and are handled sequentially. To make your code work on multiple browsers you can use the following idiom:</p><pre class="code">document.body.onclick = function (e) {
if (!e) e = event;
alert("Clicked at (" + e.clientX + "," + e.clientY + ")");
}
</pre><h4>Sigils</h4><p>Because local variables are usually better than global variables, one interesting design choice is to <strong>make locals the default and make globals ugly</strong>. Ruby does exactly this: global variables begin with a <tt>$</tt> character. That's a <a href="http://en.wikipedia.org/wiki/Sigil_%28computer_programming%29">sigil</a> — a character used within a variable name to indicate its type, category, or (in this case) scope.</p>Example: <pre class="code">$x = 10
def f
x = 3
puts x
puts $x
end
def g
puts $x
end
def h
puts x
end
f <span class="comment"># writes 3 then 10</span>
g <span class="comment"># writes 10</span>
puts $x <span class="comment"># writes 10</span>
h <span class="comment"># raises a NameError</span>
</pre><p>Another example: <pre class="code">x = 5
def f
puts x
end
f <span class="comment"># raises a NameError</span>
</pre><h4>Hiding in Closures and Anonymous Functions</h4>TODORay Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-39261352322740376512010-03-13T09:45:00.000-08:002010-05-24T08:55:55.266-07:00LambdasLambda-expressions, or anonymous functions, are common in many languages, but are either missing or approximated in others.<br />
<h4>Simple Expressions</h4><p>There are many ways to write lambda expressions when the body is a single expression:</p><pre class="code">λ x . x + 1
{x -> x + 1} // <span class="comment">Curly braces are common</span>
[x -> x + 1] // <span class="comment">Not so common, looks like an array</span>
(x -> x + 1) // <span class="comment">Let the -> operator imply function type</span>
fn x => x + 1 // <span class="comment">ML</span>
lambda (x) x + 1
(lambda (x) (+ x 1))
sub {$0 + 1} // <span class="comment">Perl</span>
{$0 + 1}
{_ + 1} // <span class="comment">Scala</span>
λ x y . 2 * x + y
{x y -> 2 * x + y}
fn x y => 2 * x + y
fn (x,y) => 2 * x + y
lambda (x, y) 2 * x + y
sub {2 * $0 + $1}
{2 * $0 + $1}
</pre><p>Some of these require a bit more lookahead to parse than others.</p><h4>Complex Function Bodies</h4><p>If your language has a sequencing expression, like C's comma expression, where <tt>e1, e2, e3</tt> means to evaluate each expression in order and return the value of the last expression encountered, then you can use the sequencing expression in any of the forms above. Operator precedence becomes the only real issue, but parentheses can make it clear exactly what the function is.</p><pre class="code">{ p, i -> k = p * 0.45359237, m = 0.0254, k / (m * m) }
</pre><p>Let-expressions work well here, too<br />
<pre class="code">{ p, i -> let k = p * 0.45359237 and m = 0.0254 in k / (m * m) end }
</pre><p>Lambdas can be arbitrarily complex, with return statements. Some languages allow you to omit the return statement if the expression to return is the last one in the sequence.</p><pre class="code">TODO
</pre><h4>Recursion</h4><p>To arrange for an anonymous function to be called recursively, it needs to have a name or the language should reserve a special keyword to refer to the currently executing function (watch for scoping rules here).</p><pre class="code">{[f] n -> n <= 1 ? 1 : f(n - 1) * n}
</pre>
<pre class="code">{n -> n <= 1 ? 1 : CALLEE(n - 1) * n}
</pre>
<p>Allowing recursion in lambda expressions is far better than requiring the applicative Y combinator.</p><h4>Closures</h4><p>Lambdas often refer to variables in outer scopes. Within the lambda expression, these outside variables are called free variables.</p><pre class="code">function incrementer(int start, int delta) {
var current = start - delta;
return function () {return current += delta;}
}
</pre><h4>Calling Lambda Expressions</h4><p>Calling, or applying a function, is normally done with juxtaposition, though again precedence rules will often come into play.
<pre class="code">(x -> x + 1)7
(x -> x + 1)(7)
[x -> x + 1](7 * y - 2)
[x -> x + 1] 7 * y - 2 <span class="comment">// application or multiplication first?</span>
<s>fn (x, y) => 2 * x + y (7, y)</s> <span class="comment">// too confusing</span>
(fn (x, y) => 2 * x + y)(7, y)
</pre><h4>Currying</h4>Sketches:
<pre class="code">[x -> [y -> 2 * x + y]]
[x -> y -> 2 * x + y]
[x y -> 2 * x + y]
fn x => fn y => 2 * x + y
fn x y => 2 * x + y
</pre><h4>Fake Lambdas</h4>TODO
<h4>Delegates</h4>TODORay Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-39895437535937782572010-03-03T18:43:00.000-08:002010-10-13T09:21:12.552-07:00Prototypes<p>There are two primary ways that languages allow you to create a set of objects with the same (or similar) structure and behavior. In the <strong>classical</strong> approach, you define <em>classes</em> and then create objects from <a href="http://language-design.blogspot.com/2010/04/object-construction.html">constructors</a> associated with the class; in this case, objects are <em>instances</em> of the class. In the <strong>prototypal</strong> approach, you create a <em>prototype</em> object, and then instantiate new objects based on the prototype. Generally, a language is either classical (e.g., Java, Ruby, C#), or prototypal (e.g, JavaScript, Self, Io, Lua), though some languages let you construct objects both ways (e.g., ActionScript, Falcon).</p><p>What might prototype-based instantiations look like?</p><h4>Simple Differential Inheritance</h4><p>Perhaps the purest approach to a prototypal mechanism is one in which every object has an implicit link to another object, called its <em>prototype</em>. Whenever we look up a property in an object and it is not present, we go look in its prototype. If the property isn't present in the prototype, we search the prototype's prototype, and so on. (Writing a property is different, however; if the property doesn't exist in the object it is created right there.) Such a language might have an operator called <tt>beget</tt> which might be used like this:<br />
<pre class="code">var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = beget protoCircle;
c1.x = 4;
c1.color = "green";
</pre><p align="center"><img src="http://cs.lmu.edu/~ray/images/proto1.png" /></p><p>Here we have <tt>c1.x == 4</tt>, <tt>c1.y == 0</tt>, <tt>c1.radius == 1</tt>, and <tt>c1.color == "green"</tt>.</p><p>We can create more circles:</p><pre class="code">var c2 = beget protoCircle;
c2.x = 4;
c2.radius = 15;
c3 = beget protoCircle;
</pre>which gives us this:<br />
<p align="center"><img src="http://cs.lmu.edu/~ray/images/proto2.png" /></p><p>If we change any of the fields in the prototype, this change is picked up <em>at run time</em> by all the derived objects that haven't overridden those fields. Ditto for adding new fields to the prototype. To some people, this is convenient and flexible; to others it is ugly and error-prone. It's all those things, of course.</p><p>The prototype is also a great place to store function properties:<br />
<pre class="code">protoCircle.area = function () {return π * (this.radius ** 2);}
</pre><p>Here we've assumed we have a "<tt>this</tt>" operator which refers to the object through which the function was called (not necessarily the one along the prototype chain which ultimately contained the function).</p><h4>Direct Cloning</h4><p>Another approach is to define a prototype with default field values then <em>clone</em> it to make new objects. Cloning is different from simple differential inheritance because with cloning the new object automatically gets copy of <em>all</em> of the fields in the prototype, even for those that you wanted to keep the default values in the prototype. A clone operator might look like this:</p><pre class="code">var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = clone protoCircle;
c1.x = 4;
c1.radius = 8;
c1.color = "green";</pre><p>Cloning means <tt>c1</tt> is going to get all four fields:</p><p align="center"><img src="http://cs.lmu.edu/~ray/images/clone1.png" /></p><p><h4>Cloning via user-defined functions</h4><p>If a language does not have a built-in clone operator, or you just don't like the look of creating an object and then doing multiple assignments to override defaults, you can do the cloning and field assignments in a function. Default fields can be nicely managed with default parameters, in languages that have them:</p><pre class="code">function newCircle(x = 0, y = 0, r = 1, c = "black") {
return {x: x, y: y, radius: r, color: c};
}
var c1 = newCircle(x => 4, r => 8, color => "green");
var c2 = newCircle(color => "yellow", r => 5);
var c3 = newCircle(_, 5, _, "blue");
</pre><p>or with a short circuit disjunction if you have that but no default parameters:</p><pre class="code">function newCircle(x, y, r, c) {
return {x: x || 0, y: y || 0, radius: r || 1, color: c || "black"};
}
var c1 = newCircle(x => 4, r => 8, color => "green");
</pre><p>With cloning, we have the problem of where to put the objects' methods. How about <a href="http://language-design.blogspot.com/2010/04/global-variables.html">global variables</a>?</p><pre class="code">function circleArea(c) {return π * c.radius ** 2;}
function circlePerimeter(c) {return 2 * π * c.radius;}
</pre><p>That's awful! It pollutes the global namespace. What about putting them inside the objects?</p><pre class="code">function newCircle(x = 0, y = 0, r = 1, c = "black") {
return {
x: x, y: y, radius: r, color: c,
area: function () {return π * this.radius ** 2;}
perimeter: function () {return 2 * π * this.radius;}
};
}
</pre><p>That's no good either — every time a circle object is created, two additional function objects are created for its area and perimeter. If we create thousands of circles we end up with thousands of identical area functions and thousands of perimeter functions sucking up memory. We should make sure we have only one copy of each (without defining global variables to access them). We want that nice little implicit link to the prototype back.<br />
<h4>JavaScript Constructors</h4><p>In JavaScript, prefixing a function call with <strong>new</strong> calls the function in a context in which the <tt>this</tt> expression refers to a newly constructed object, and, unless a return statement explicitly returns some other object, returns this newly constructed object.</p><pre class="code">function Circle(x, y, r, c) {
this.x = x || 0;
this.y = y || 0;
this.radius = r || 1;
this.color = c || "black";
}
</pre><p>It's also the case that JavaScript is prototypal, and weirdly so: every function is itself an object, with a property called <tt>prototype</tt>, whose value is the object that will be used as the prototype of every object created with that function. As usual, the prototype is the place where you drop in shared properties, like functions. Here's an example:</p><pre class="code">function Circle(x, y, r, c) {
this.x = x || 0;
this.y = y || 0;
this.radius = r || 1;
this.color = c || "black";
}
Circle.prototype.area = function () {return Math.PI * this.radius * this.radius;}
Circle.prototype.perimiter = function () {return Math.PI * 2 * this.radius;}
var c1 = new Circle(3, 8, 4, "green");
var c2 - new Circle(1, 7);
var c3 = new Circle();
</pre><p align="center"><img src="http://cs.lmu.edu/~ray/images/js-circles.png" /></p><h4>Better Prototypal Patterns for JavaScript</h4><p>It's pretty clear that JavaScript's <tt>new</tt> operator and the associated semantics for construction were designed to make the langusge look more classical. Douglas Crockford characterizes this beautifully by saying that "JavaScript is conflicted about its prototypal nature." He shows how to make JavaScript implement simple differential inheritance:<br />
<pre class="code">if (typeof Object.beget !== 'function') {
Object.beget = function(o) {
var F = function () {};
F.prototype = o;
return new F();
};
}
</pre><p>Our running circle example can now be written like this in JavaScript:</p><pre class="code">var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = Object.beget(protoCircle);
c1.x = 4;
c1.color = "green";
</pre><p>One of the nice things about using <tt>beget</tt> is we can hide away JavaScript's crazy <tt>new</tt> operator and never have to use it again. This operator is fairly confusing and error-prone. For example, when you call functions that are supposed to be called with <tt>new</tt> but aren't, the code ends up creating or overwriting global variables. <tt>Object.beget</tt> is a nicer approach.</p><h4>ECMAScript 5 Improvements to JavaScript</h4><p>Current JavaScript versions (as of mid-2010) are based on the 3rd Edition of ECMAScript, known as ECMAScript 3. The newly standardized ECMAScript 5 adds many properties to the <tt>Object</tt> object, including <tt>create</tt>, which is like the <tt>beget</tt> example above. In new JavaScript:</p><pre class="code">var protoCircle = {x: 0, y: 0, radius: 1, color: "black"};
var c1 = Object.create(protoCircle, {
x: {value: 4},
color: {value: "green"}
});
</pre><p>Much better.</p>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-32525069392784880842010-02-27T16:52:00.000-08:002012-09-19T19:08:34.125-07:00Object-Specific Methods<p>Sometimes a programmer likes to attach a method to a single object, instead of attaching it to all objects of a class. How can this be done?</p><p>We'll make three dogs, Sparky, Spot, and Spike. All dogs bark, but only Spike can bite.</p>
<h4>Classless Languages</h4>
<p>In JavaScript, just add function properties directly to the object:</p>
<pre class="code">
var Dog = function (name) {this.name = name;};
Dog.prototype.bark = function () {return "woof";};
var d1 = new Dog("Sparky");
var d2 = new Dog("Spot");
var d3 = new Dog("Spike");
d3.bite = function () {return "GOTCHA";};
</pre>
<p>It is hard to imagine improving on that.</p>
<h4>Singleton Classes</h4>
<p>In Ruby, you add the method to the object's <b>singleton class</b>:</p>
<pre class="code">
class Dog
def bark(); "woof"; end
end
d1 = Dog.new("Sparky");
d2 = Dog.new("Spot");
d3 = Dog.new("Spike");
class <<d3
def bite(); "GOTCHA"; end
end
</pre>
<p>The use of "<<" isn't as readable as it could be, at least for English speakers, so might there be other ways to do this for classes languages?</p>
<pre class="code">
Object.add_method(d3, "bite", {"GOTCHA"});
d3.add_method("bite", {"GOTCHA"});
def d3.bite(); "GOTCHA"; end;
</pre>
<p>Regardless of the syntax, with a class-based language, each object will have an internal reference not only to its class, but to its bundle of object-specific methods.</p>
<h4>Subclasses</h4>
<p>In a language where classes are closed, you "extend" classes by making new subclasses. This seems a little uglier in our contrived example. In Java:</p>
<pre class="code">class Dog {
private String name;
public Dog(String name) {this.name = name;}
public String bark() {return "woof";}
}
class BitingDog extends Dog {
public BitingDog(String name) {super(name);}
public String bite() {return "GOTCHA";}
}
Dog d1 = new Dog("Sparky");
Dog d2 = new Dog("Spot");
BitingDog d3 = new BitingDog("Spike");
</pre>
<p>This may be clunky, but it does make a lot more static analysis possible.</p>
Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-85342041041557440232010-02-11T22:02:00.001-08:002012-01-25T23:14:10.776-08:00Bracketing<p>At some point in the history of programming languages, the idea of code as lists of statements with jumps or gotos gave way to the idea of nested (hierarchical) declarations and control structures, as in:</p>
<p align="center"><img src="http://cs.lmu.edu/~ray/images/ast.png" alt="ast" /></p>
<p>So programs are really trees, but we still need a string representation for storage and transmission. How can we capture trees in text? There are two main approaches. The first is to just give a direct representation of the syntax tree. The other is to be creative with syntax, making something that a human would like to read.</p><p>Here are several sketches of representations of the above tree.</p><h4>S-Expressions</h4><p>The usual direct representation of trees in strings is that of "S-expressions" — parenthesized forms such as <tt>(A B C D)</tt> where A is the root and B, C, and D are A's (ordered) children. B, C, and D can be primitive values or trees.</p>
<pre class="code">
(function f (params x y)
(block
(var b 3)
(if (< x (* a (+ b 7)))
(while found
(block
(for i (range 2 5) (print i))
(if (== a 2) (break))))
(= p (cos 5)))
(return (/ a 5))))
</pre>
<p>Languages like <a href="http://en.wikipedia.org/wiki/Lisp_%28programming_language%29">LISP</a>, <a href="http://en.wikipedia.org/wiki/Scheme_%28programming_language%29">Scheme</a>, and <a href="http://en.wikipedia.org/wiki/Clojure">Clojure</a> use S-expressions (though with a few little extensions here and there).</p><p>In the example above, some of the objects aren't actually evaluated, so the distinction between symbols and values sometimes comes up. Occasionally this is made explicit in the syntax, for example:</p>
<pre class="code">
(function 'f (params 'x 'y)
(block
(var 'b 3)
(if (< x (* a (+ b 7)))
(while found
(block
(for 'i (range 2 5) (print i))
(if (== a 2) (break))))
(= p (cos 5)))
(return (/ a 5))))
</pre>
<p>Instead of the quote character for symbols, the colon or ampersand (or other character) could be used.</p><h4>XML Style</h4><p>The tree can be given directly in XML, too. This is probably too verbose, not only because we have some markup characters, but because <em>we have to invent elements for leaf nodes</em>.</p>
<pre class="code">
<function name="f" params="x,y">
<block>
<var id="b"><intlit value="3"/></var>
<if>
<less>
<ref id="x"/>
<times>
<ref id="a"/>
<plus><ref id="b"/><intlit value="7"></plus>
</times>
</less>
<while>
<ref id="found"/>
<block>
<for var="i">
<range><intlit value="2"/><intlit value="5"/></range>
<print><ref id="i"/></print>
</for>
<if>
<eq><ref id="a"/><intlit value="2"/></eq>
<break/>
</if>
</block>
</while>
<assign>
<ref id="p"/>
<cos><intlit value="5"/></cos>
</assign>
</if>
<return>5</return>
</block>
</function>
</pre>
<p>Reader exercise: Could this have been done without the <strong>ref</strong> and <strong>intlit</strong> elements?</p><h4>JSON Style</h4><p>Another form of direct tree representation is an encoding in JSON. This gets pretty verbose as well because in JSON the properties are unordered, so naming them seems to be the most reasonable approach. And again, we have to name <em>everything</em>:</p>
<pre class="code">
{type: "function", name: "f", params: ["x", "y"], body: {
type: "block", statements: [
{type: "var", id: "b", init: 3},
{type: "if",
condition: {type: "<", left: "x", right: {
type: "*",
left: "a",
right: {type: "+", left: "b", right: 7}
}},
truePart: {type: "while",
condition: "found", body: {type: "block", statements: [
{type: "for", var: "i",
range: {type: "range", low: 2, high: 5},
body: {type: "print", args: ["i"]}
},
{type: "if",
condition: {type: "==", left: "a", right: 2},
truePart: {type: "break"}
}
]}
},
falsePart: {type: "assign",
target: "p",
source: {type: "cos", args: [5]}
}
},
{type: "return", value: 5}
]
}}
</pre>
<h4>Indentation</h4>
<p>Okay, let's move away from direct representations and use some real syntax. By real syntax we mean cleverly arranging symbols in a pleasing and readable way but still having it be possible to unambiguously recapture the tree structure.</p>
<pre class="code">
function f (x, y)
var b = 3
if x < a * (b + 7)
while found
for i in 2..5
print i
if a == 2
break
else
p = cos(5)
return a / 5
</pre>
<h4>Curly Braces</h4>
<p>The use of curly braces to define subtrees is, for some reason, very popular.</p>
<pre class="code">function f (x) {
var b = 3;
if (x < a * (b + 7)) {
while (found) {
for (i in 2..5) {
print i;
}
if (a == 2) {
break;
}
}
} else {
p = cos(5);
}
return a / 5;
}
</pre>
<h4>Terminating Keywords</h4>
<p>Rather than marking the beginning and end of a block, as is done with the curly brace style and its equivalents, we can get away with <i>just</i> marking the end. This can be done simply with the word <strong>end</strong>:</p>
<pre class="code">
function f (x, y)
var b = 3
if x < a * (b + 7)
while found
for i in 2..5
print i
end
if a == 2
break
end
end
else
p = cos(5)
end
return a / 5
end
</pre>
<p>or by spelling the initial word backwards (yes, many languages do this):</p>
<pre class="code">
function f (x, y)
var b = 3
if x < a * (b + 7)
while found
for i in 2..5
print i
rof
if a == 2
break
fi
elihw
else
p = cos(5)
fi
return a / 5
noitcnuf
</pre>
<h4>Postfix Style</h4>
<p>Here's a sketch using the postfix style. Here it's definitely important, I think, to distinguish symbols from variables that need to be evaluated. I used square brackets to bundle code, as should be obvious:</p>
<pre class="code">
[
'x' param 'y' param
3 'b' var
x a b 7 + * <
[
found
[
'i' 2 5 [i print] for
2 a == [break] [] if
]
while]
[5 cos p assign]
if
a 5 / return
] 'f' function
</pre>
<h4>Functional Style, with 'let' and Infix Operators</h4>
<p>Here's a style that borrows from <a href="http://en.wikipedia.org/wiki/Standard_ML">Standard ML</a>. Block scope is introduced with the <strong>let</strong>...<strong>in</strong>...<strong>end</strong> expression, and expression sequences use semicolons as separators. Expression sequences are allowed in the body of a let, or wherever a regular expression can go, provided the sequence is enclosed in parentheses.</p>
<pre class="code">
function f (x, y) =
let
b = 3
in
if x < a * (b + 7) then
while found do (
for i in 2..5
print i
;
if a == 2 then
break
else
nil
)
else
p = cos(5)
;
a / 5
end
</pre>Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0tag:blogger.com,1999:blog-3790736519734472386.post-992206820066635682010-02-08T17:50:00.001-08:002016-10-01T20:11:08.512-07:00Else ifsYou may have heard of the so-called <em>dangling else</em> problem. The text<br />
<pre class="code">if e1 then if e2 then s1 else s2</pre>
can reasonably mean either of two possible things:<br />
<div style="text-align: center;">
<img alt="danglingelse.png" src="http://cs.lmu.edu/%7Eray/images/danglingelse.png" /></div>
When mixed with other statements without any obvious end, things can get crazy:<br />
<pre class="code">if e1 then while e2 do if e3 then s1 else s2</pre>
The dangling-else problem arises in languages with ambiguous grammars, such as the Pascal-like:<br />
<pre class="grammar">Stmt = Assignment | IfStmt | WhileStmt | Block | ...
IfStmt = "if" Exp "then" Stmt ("else" Stmt)?
WhileStmt = "while" Exp "do" Stmt
Block = "begin" Stmt* "end"
</pre>
or the C-like:<br />
<pre class="grammar">Stmt = Assignment | IfStmt | WhileStmt | Block | ...
IfStmt = "if" "(" Exp ")" Stmt ("else" Stmt)?
WhileStmt = "while" "(" Exp ")" Stmt
Block = "{" Stmt* "}"
</pre>
There are many ways to deal with this problem, but most lead to rather unsatisfactory ways to handle the general multi-way conditional. Is there a design that avoids dangling else problems <em>and</em> forces programmers to write nice multiway conditional statements?<br />
<h4>
Resolve with a semantic rule</h4>
In Pascal and C, the syntax is ambiguous, but a <em>semantic rule</em> says that each <span style="font-family: "courier new" , "courier" , monospace;">else</span> must match the closest <span style="font-family: "courier new" , "courier" , monospace;">if</span>, <strong>regardless of indentation.</strong> This can be terribly confusing to beginners; the code below on the left doesn't mean what novices might think:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="50%"><pre class="code"><span class="comment">(* Without blocks, 'else'
tied to second 'if' *)</span>
if cond1 then
if cond2 then
stmt1
else <span class="comment">(* Sorry! Goes w/ cond2! *)</span>
stmt2
</pre>
</td><td><pre class="code"><span class="comment">(* MUST use block to tie
'else' to first 'if' *)</span>
if cond1 then begin
if cond2 then
stmt1
end else <span class="comment">(* Now it goes w/ cond1. *)</span>
stmt2
</pre>
</td></tr>
</tbody></table>
One objection to this approach is that purists will probably hate how this solution relies on a semantic rule to handle something so inherently structural. A second objection is that multiway conditionals don’t really exist—the syntax can only describe deeply nested individual conditionals. So programmers end up playing games with their indentation to simulate multiway branching:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="50%"><pre class="code">if cond1 then
stmt1
else if cond2 then
stmt2
else if cond3 then
stmt3
else if cond4 then
stmt4
else
stmt5
</pre>
</td><td><pre class="code">if cond1 then begin
stmts1
end else if cond2 then begin
stmts2
end else if cond3 then begin
stmts3
end else if cond4 then begin
stmts4
end else begin
stmts5
end</pre>
</td></tr>
</tbody></table>
This is ugly—but it could be a lot worse: programmers often mix up their conditional arms by sometimes using blocks and sometimes not, and get the ugly code they deserve. The worst part of the “semantic rule solution” dangling else syntax is that there seem to be hundreds of different ways to format multiway if statements. Take a look at some old Pascal textbooks if you want to be grossed out.<br />
<h4>
Complicate the grammar to remove the ambiguity</h4>
It is possible, with a lot of work, to force all <span style="font-family: "courier new" , "courier" , monospace;">else</span>s to be paired with the nearest <span style="font-family: "courier new" , "courier" , monospace;">then</span>. Call an <i>open statement</i> one with a sub-statement that dangles at the end, and a <i>closed statement</i> one that is not open. Then:<br />
<pre class="grammar">Stmt = OpenStmt | ClosedStmt
OpenStmt = IfThenStmt | IfThenElseStmt | WhileStmt | ForStmt
ClosedStmt = Assignment | Call | BreakStmt | ReturnStmt
NoShortIfStmt = ClosedStmt | NoShortIfIfThenElse
| NoShortIfWhile | NoShortIfFor
IfThenStmt = "if" Exp "then" Stmt
IfThenElseStmt = "if" Exp "then" NoShortIfStmt "else" Stmt
NoShortIfIfThenElse =
"if" EXP "then" NoShortIfStmt "else" NoShortIfStmt
NoShortIfWhile = "while" Exp "do" NoShortIfStmt
NoShortIfFor = "for" Var "in" Range "do" NoShortIfStmt
etc.</pre>
This grammar is no longer ambiguous, and every <span style="font-family: "courier new" , "courier" , monospace;">else</span> is automatically connected to the nearest <span style="font-family: "courier new" , "courier" , monospace;">if</span>. But it does nothing to prevent the confusion arising from misindented code we saw above.<br />
<h4>
Require the <span style="font-family: "courier new" , "courier" , monospace;">else</span> part</h4>
If there always <em>has</em> to be an else part, there is never any ambiguity, but you’ll have funny looking code when your <strong>else</strong> part is missing: you'll need some empty statement or a special <span style="font-family: Courier New, Courier, monospace;">null</span> keyword.<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="50%"><pre class="code">if cond1 then
if cond2 then
stmt1
else
null
else
stmt2</pre>
</td><td><pre class="code">if cond1 then
if cond2 then
stmt1
else
stmt2
else
null</pre>
</td></tr>
</tbody></table>
Not pretty. It works, though, even if the <span style="font-family: Courier New, Courier, monospace;">while</span> statement (and other compound statements) aren’t fully bracketed, for example:<br />
<pre class="code">if e1 then while e2 do if e3 then s1 else null else s2</pre>
and<br />
<pre class="code">if e1 then while e2 do if e3 then s1 else s2 else null</pre>
<br />
<h4>
Require bracketing</h4>
Ultimately, languages that have compound statements where the bodies are “trailing statements” leave too many formatting choices and look ugly and unbalanced to many people. Defining a language’s syntax to <strong>require</strong> bracketed compound statements is a Good Thing for that reason, and as a bonus it <em>removes the dangling else problem completely</em>. The idea is that a block <strong>should not be a kind of statement</strong>, and that all compound statements use blocks for bodies, never simple statements! This grammar fragment:<br />
<pre class="grammar">Stmt = Assignment | IfStmt | WhileStmt | ...
IfStmt = "if" "(" Exp ")" Block ("else" Block)?
WhileStmt = "while" "(" Exp ")" Block
Block = "{" Stmt* "}"</pre>
yields code like:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="50%"><pre class="code">if (cond1) {
if (cond2) {
stmts1
} else {
stmts2
}
}</pre>
</td><td><pre class="code">if (cond1) {
if (cond2) {
stmts1
}
} else {
stmts2
}</pre>
</td></tr>
</tbody></table>
Bracketing can also be done with just a terminating <span style="font-family: "courier new" , "courier" , monospace;">end</span> instead of curly braces or <span style="font-family: "courier new" , "courier" , monospace;">begin</span>-<span style="font-family: "courier new" , "courier" , monospace;">end</span> pairs:<br />
<pre class="grammar">Stmt = Assignment | IfStmt | WhileStmt | ...
IfStmt = "if" Exp "then" Stmt+ ("else" STMT+)? end
WhileStmt = "while" Exp "do" Stmt+ "end"</pre>
yielding rather clean code like this:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="50%"><pre class="code">if cond1 then
if cond2 then
stmts1
else
stmts2
end
end</pre>
</td><td><pre class="code">if cond1 then
if cond2 then
stmts1
end
else
stmts2
end</pre>
</td></tr>
</tbody></table>
But, wait—if we really require bracketing, won’t that make multiway conditionals ugly? Either you start indenting too much or you get a bunch of <span style="font-family: "courier new" , "courier" , monospace;">}</span>s (or <span style="font-family: "courier new" , "courier" , monospace;">end</span>s) at the end. Like this, right?<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="50%"><pre class="code">if cond1 {
stmts1
} else {
if cond2 {
stmts2
} else {
if cond3 {
stmts3
} else {
if cond4 {
stmts4
} else {
stmts5
}
}
}
}
</pre>
</td> <td><pre class="code">if cond1 {
stmts1
} else { if cond2 {
stmts2
} else { if cond3 {
stmts3
} else { if cond4 {
stmts4
} else {
stmts5
}}}}
</pre>
</td> </tr>
</tbody></table>
<h4>
Make indentation matter</h4>
In Python, there is no dangling else problem since the indentation makes things clear:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr> <td width="50%"><pre class="code">if cond1:
if cond2:
stmts1
else
stmts2
</pre>
</td> <td><pre class="code">if cond1:
if cond2:
stmts1
else
stmts2
</pre>
</td> </tr>
</tbody></table>
But, wouldn't required indentation lead to funny looking code in multiway conditionals?<br />
<pre class="code">if cond1:
stmts1
else:
if cond2:
stmts2
else
if cond3:
stmts3
else
if cond4:
stmts4
else
stmts5
</pre>
While the code above is legal Python, real Python programmers use the next solution.<br />
<h4>
Introduce a new keyword to simplify required bracketing or indentation</h4>
In most languages where bracketing (or indentation) is required, the multiway conditional is described syntactically as a single if-statement, usually with the help of a special keyword (called <strong>elsif</strong> in Ada, Ruby, and Perl; <strong>elif</strong> in bash and Python; and <strong>elseif</strong> in PHP. We have both a curly-brace style:<br />
<pre class="grammar">IfStmt = "if" "(" Exp ")" Block
("elsif" "(" Exp ")" Block)*
("else" Block)?
Block = "{" STMT* "}"
</pre>
and terminating-end style:<br />
<pre class="grammar">IfStmt = "if" EXP "then" Stmt +
("elsif" Exp "then" Stmt+)*
("else" Stmt+)?
"end"
</pre>
Code looks like this:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr> <td width="33%"><pre class="code"><span class="comment"># Ruby</span>
if cond1
stmts1
elsif cond2
stmts2
elsif cond3
stmts3
elsif cond4
stmts4
else
stmts5
end
</pre>
</td><td width="33%"><pre class="code"><span class="comment"># Python</span>
if cond1:
stmts1
elif cond2:
stmts2
elif cond3:
stmts3
elif cond4:
stmts4
else
stmts5
</pre>
</td> <td><pre class="code"><span class="comment">// PHP</span>
if (cond1) {
stmts1
} elseif (cond2) {
stmts2
} elseif (cond3) {
stmts3
} elseif (cond4) {
stmts4
} else {
stmts5
}
</pre>
</td></tr>
</tbody></table>
<h4>
Lisp’s COND</h4>
Because Lisp is basically written in abstract syntax trees, you won’t ever have a dangling else, and the multiway conditional is already handled by <span style="font-family: "courier new" , "courier" , monospace;">COND</span>:<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td><pre class="code">(COND
(condition1 block1)
(condition2 block2)
(condition3 block3)
(condition4 block4)
(T block5))</pre>
Interestingly, Erlang uses the same idea, though with a terminating-<span style="white-space: normal;"><span style="font-family: "courier new" , "courier" , monospace;">end</span></span><span style="font-family: "times"; white-space: normal;"> syntax. There are no explicit </span><span style="white-space: normal;"><span style="font-family: "courier new" , "courier" , monospace;">else</span></span><span style="font-family: "times"; white-space: normal;">s; Erlang’s if is inherently multiway:</span></td></tr>
</tbody></table>
<pre class="code">if
condition1 -> body1;
condition2 -> body2;
condition3 -> body3;
condition4 -> body4;
end.
</pre>
<h4>
Require bracketing but accept lookahead</h4>
There is a way to require bracketing without resorting to special words like <tt>elsif</tt> or <tt>elif</tt> and without ending up with a whole mess of terminators at the end. The solution is amazingly simple:<br />
<pre class="grammar">IfStmt = "if" "(" Exp ")" Block
("else" "if" "(" Exp ")" Block)*
('else' Block)?
Block = "{" Stmt* '}'
</pre>
I haven't seen this before. Why is this not popular? The only thing I can think of is that top-down parsers will need a two-token lookahead when encountering an <span style="font-family: "courier new" , "courier" , monospace;">else</span>. Why should this be a big deal? Just peek ahead to see if the next token is an <span style="font-family: "courier new" , "courier" , monospace;">if</span> (or a <span style="font-family: "courier new" , "courier" , monospace;">{</span>).<br />
How can we do this for a terminating-end syntax? This doesn't work very well:<br />
<pre class="grammar">IfStmt = <s>"if" Exp "then" Stmt+</s>
<s>("else" "if" Exp "then" Stmt+)*</s>
<s>("else" Stmt+)?</s>
<s>"end"</s></pre>
because the <em>amount of required lookahead is infinite</em>. What about rejecting 'if' statements in the final <span style="font-family: "courier new" , "courier" , monospace;">else</span>-part?<br />
<pre class="grammar">Stmt = IfStmt | NonIfStmt
IfStmt = "if" Exp "then" Stmt+
("else" "if" Exp "then" Stmt+)*
("else" NonIfStmt Stmt*)?
"end"</pre>
There we go! As long as <span style="font-family: "courier new" , "courier" , monospace;">NonIfStmt</span> cannot be empty and cannot start with <span style="font-family: "courier new" , "courier" , monospace;">if</span>, we can parse top-down with a lookahead of 2.<br />
<h4>
Novel syntactic forms</h4>
If the objection to <span style="font-family: "courier new" , "courier" , monospace;">elsif</span> and related words is just that they are made up, we could give the <span style="font-family: Courier New, Courier, monospace;">if</span>-statement a make over, using more reasonable words, or symbols, even::<br />
<table border="0" style="width: 100%px;"><tbody>
<tr><td width="33%"><pre class="code">when cond1 do
stmts1
or when cond2 do
stmts2
or when cond3 do
stmts3
or when cond4 do
stmts
or else do
stmts
end
</pre>
</td><td width="33%"><pre class="code">try cond1 do
stmts1
or cond2 do
stmts2
or cond3 do
stmts3
or cond4 do
stmts
otherwise
stmts
end
</pre>
</td><td><pre class="code">try [ cond1 ] =>
stmts1
[ cond2 ] =>
stmts2
[ cond3 ] =>
stmts3
[ cond4 ] =>
stmts
[] =>
stmts
end
</pre>
</td></tr>
</tbody></table>
Ray Toalhttp://www.blogger.com/profile/11678817534637339600noreply@blogger.com0