Mike Schaeffer's Blog

Articles with tag: lisp
June 5, 2023

If you've been around programming for a while you've no doubt come across the Lisp family of languages. One of the oldest languages still in use, Lisp has contributed much to the profession, but it's probably most infmamous for the "S-expression". S-expressions are a text based serialization of the languages core-data structures. Since Lisp is written in terms of those same data structures, the S-expression is also the syntax of the langauge.

To give you a taste if you're not familar, here's a simple Clojure function for parsing an string identifier. The identifier is passed in as either a numeric string (123) or a hash ID (tlXyzzy), and the function parses either form into a number.

(defn decode-list-id [ list-id ]
  (or (try-parse-integer list-id)
      (hashid/decode :tl list-id)))

In a "C-Like Langauge", the same logic looks more or less like this:

function decodeListId(listId) {
    return tryParseInteger(listId) || hashid::decode("tl", listId);
}

Right off the bat, you'll notice a heavy reliance on parenthesis to delimit logical blocks. With the exception of the argument list ([ list-id ]), every logical grouping in the code is delimited by parenthesis. You'll also notice the variable name (list-id) contains a hyphen - not allowed in C-like languages. I could point out more, but even stopping there, it's clear that Lisp syntax is unusual to modern eyes.

What may be even more unusual about this syntax is the fact that some people like it. I count myself among them. It's strange, but there are reasons for the strangeness. The strangeness, while it imposes costs, also offers benefits. It's these benefits that I wish to discuss.

Before I continue, I'd like to first credit Fernando Borretti's recent post on Lisp syntax. It's always good to see a defense of Lisp syntax, and I think his article nicely illustrates the way that the syntax of the langauage supports one of Lisp's other hallmark features: macros. If you haven't already read it, you should click that link and read it now. That said, there's more to the story, which is why I'm writing something myself.

If you've studied compilers, it's probably struck you how much of the first part of the source is spent on various aspects of language parsing. You'll study lexical analysis, which lets you divide streams of characters into tokens. Once you understand the basics of lexical analysis, you'll them study how to fold linear sequences of tokens into trees according to a grammar. Then, a few more tree transformations, and finally linearization back to a sequence of instructions for some more primitive machine. Lisp's syntax concerns the first two steps of this - lexical and syntactic analysis.

Lexical analysis for Lisp is very similar to lexical analysis for other languages. The main differences are the rules are a bit different. Lisp allows hyphens in symbols (see above), and other languages do not. This changes how the language looks, but isn't a huge structural advantage to Lisp's syntax:

(defn decodeListIid [ listId ]
  (or (tryParseInteger listId)
      (hashid/decode :tl listId)))

Where things get interesting for Lisp is in the syntactic analysis stage - the folding of linear lists of tokens into trees. One of the first parsing techniques you might learn while studying compilers is known as predictive recursive descent, specifically for LL(1) grammars. Without going into details, these are simple parsers to write by hand. The grammar of an LL(1) language can be mapped directly to collections of functions. Then, if there's a choice to be made during parsing, it can always be resolved by looking a single token ahead to predict the next rule you need to follow. These parsers have many limitations in what they can parse (no infix expressions), but they can parse quite a bit, and they're easy to write.

Do you see where this is going? Lisp falls into the category of languages that can easily be parsed using a recursive descent parser. Another way to put it is that it doesn't take a lot of sophistication to impart structure on a sequence of characters representing a Lisp program. While It is may be hard to write a C++ parser, it's comparatively easy to write one for Lisp. Thanks to the simple nature of a Lisp's grammar, the language really wears its syntax tree on its sleeve. This is and has been one of the key advantages Lisp derives from its syntax.

The first advantage is that simple parsing makes for simple tooling. If it's easier to write a parser for a language, it's easier to write external tools for that langauge that understand it in terms of its syntax. Emacs' paredit-mode is a good example of this. paredit-mode offers commands for interacting with Lisp code on the level of its syntactic structure. It lets you cut text based on subexpressions, swap subexpressions around, and similar sorts of operations based on the structure of the language. It is easier to write tools that operate on a langauge like this if the syntax is easily parsed. To see what I mean, imagine a form of paredit-mode for C++ and think how hard it would be to cut a subexpression there. What sorts of parsing capabilities would that command require, and how would it handle the case where code in the editor is only partially correct/

This is also true for human users of this sort of tooling. Lisp's simple grammar enables it to wear its structure on its sleeve for automatic tools, but also for human users of those tools. The properties of Lisp that make it easy for tools to identify a specific subexpression also make it easier for human readers of a block of code to identify that same subexpression. To put it in terms of paredit-mode, it's easier for human readers to understand what the commands of that mode will do, since the syntactic structure of the language is so much more evident.

A side benefit to a simple grammar is that simpler grammars are more easily extended. Fernando Boretti speaks to the power of Lisp macros in his article, but Common Lisp also offers reader macros. A reader macro is bound to a character or sequence of characters, and receives control when the standard Lisp reader encounters that sqeuence. The standard Lisp reader will pass in the input stream and allow the reader macro function to do what it wants, returning a Lisp value reflecting the content of what it read. This can be used to do things like add support for XML literals or infix expressions.

If the implications are not totally clear, Lisp's syntactic design is arguably easier for tools, and it allows easier extension to completely different syntaxes. The only constraint is that the reader macro has to accepts its input as a Lisp input stream, process somehow with Lisp code, and then return the value it "read" as a single Lisp value. It's very capable, and fits naturally into the simple lexical and syntactic structure of a Lisp. Infix languages have tried to be this extensible, but have largely failed, due to the complexity of the task.

Of course, the power of Lisp reader macros is also their weakness. By operating at the level of character streams (rather than Lisp data values) they make it impossible for external tools to fully parse Common Lisp source text. As soon as a Lisp reader macro becomes involved, there exists the possiblity of character sequences in the source text that are entirely outside the realm of a standard s-expression. This is like JSX embedded in JavaScript or SQL embedded in C - blocks of text that are totally foreign to the dominant language of the source file. While it's possible to add special cases for specific sorts of reader macros, it's not possible to do this in general. The first reader macro you write will break your external tools' ability to reason about the code that use it.

This problem provides a great example of where Clojure deviates from the Common Lisp tradition. Rather than providing full reader macros, Clojure offers tagged literals. Unlike a reader macro, a tagged literal never gets control over the reader's input stream. Rather, it gets an opportunity at read-time to process a value that's already been read by the standard reader. What this means is that a tagged literal process data very early in the compilation process, but it does not have the freedom to deviate from the standard syntax of a Clojure S-expression. This implies both flexibility to customize the reader and the ability for external tools to fully understand ahead of time the syntax of a Clojure source file, regardless of whether or not it uses tagged literals. Whether or not this is a good trade off might be a matter of debate, but it's in the context of a form of customization that most languages don't offer at all.

To be clear, there's more to the story. As Fernando Boretti mentions in his article, Lisp's uniform syntax extends across the language. A macro invocation looks the same as a special form, a function call, or a defstruct. Disambiguting between the various semantics of a Lisp form requires you to understand the context of the form and how symbols within that form are bound to meanings within this context. Put more simply, a function call and a macro invocation can look the same, even though they may have totally different meanings. This is a problem, and it's a problem that directly arises from the simplicity of Lisp syntax I extoll above. I don't have a solution to this problem other than to observe that if you're going to adopt Lisp syntax and face the problems of that syntax, you'd do well to fully understand and use the benefits of that syntax as compensation. Everything in engineering, as in life, is a tradeoff.

It's that last observation that's my main point. We live in a world where the mathematical tradition has, for centuries, been infix expressions. This has carried through to programming, that has also significantly converged on C-like syntax for its dominant languages. Lisp stands against both of these traditions in its choice of prefix expressions written in a simpler grammar than the norm. There are costs to this choice, and these costs tend to be immediately obvious. There are also benefits, and these benefits take time to make themselves known. If you have that time, it can be a a rewarding thing to explore the possibilities, even if you never get the chance to use them directly in production.

March 26, 2014

Update 2019-01-17: KSM recently redesigned their website in a way that removes the original blog. Because of this, I've taken some of what I wrote then for KSM and re-hosted it here. Thanks are due both to KSM Technology Partners for allowing me to do this and to the Wayback Machine for retaining the content. All the links below are updated to reflect the articles' new locations.


Sorry for the radio silence, but recently I've been focusing my writing time on the KSM Techology Partners Blog. My writing there is still technical in nature, but it tends to be more heavily focused on the JVM. If you're interested, here are a few of what I consider to be the highlights.

In mid-2013, I started out writing about how to use Runnable to explictly enforce dynamic extent in Java. In a nutshell, this is a way to implement try...with...resources in versions of Java that don't have it built in to the language. I then used the dynamic extent technique to build a ThreadLocal that plays nicely with thread pools. This is useful because thread pools require an understanding of which thread you're running on, which thread pooling techniques can abstract away.

Later in the year, I focused more on Clojure, starting off with a quick bit on the relationship of lexical closures to Java inner classes. I also wrote about a particular kind of stack overflow exception that can happen with lazy sequences. Lazy sequences can nicely remove the need to use recursion while traversing their length, but each time two unrealized lazy sequences are combined, it adds to the recursive depth required to compute the first element. For me, this stack overflow was a difficult error to diagnose, because it seemed so counter-intuitive.

I'm also in the middle of a series of posts that relate the GoF command pattern to functional programming. The posts start off with Java, but will ultimately describe a Clojure implementation that compiles a stack based expression language into optimized Java bytecode. If you'd like to play with the code, it's on github.

May 30, 2012

In my Lisp programming, I find myself using Anaphoric Macros quite a bit. My first exposure to this type of macro (and deliberate variable capture) was in Paul Graham's On Lisp. Since I haven't been able to find Emacs Lisp implementations of these macos, I wrote my own.

The first of the two macros is an anaphoric version of the standard if special form:

(defmacro aif (test if-expr &optional else-expr)
  "An anaphoric variant of (if ...). The value of the test
expression is locally bound to 'it' during execution of the
consequent clauses. The binding is present in both consequent
branches."
  (declare (indent 1))
  `(let ((it ,test))
     (if it ,if-expr ,else-expr)))

The second macro is an anaphoric version of while:

(defmacro awhile (test &rest body)
  "An anaphoric varient of (while ...). The value of the test
expression is locally bound to 'it' during execution of the body
of the loop."
  (declare (indent 1))
  (let ((escape (gensym "awhile-escape-")))
    `(catch ',escape
       (while t
         (let ((it ,test))
           (if it
               (progn ,@body)
             (throw ',escape ())))))))

What both of these macros have in common is that they emulate an existing conditional special form, while adding a local binding that makes it possible to access the result of the condition. This is particularly useful in scenarios where a predicate function returns a true value that contains useful information beyond t or nil.

June 29, 2011

One typical property of Lisp systems is that they they intern symbols. Roughly speaking, when symbols are interned, two symbols with the same print name will also have the same identity. This design choice has several significant implications elsewhere in the Lisp implementation. It is also one of the places where Clojure differs from Lisp tradition.

In code, the most basic version of the intern algorithm is easy to express:

(define (intern! symbol-name)
   (unless (hash-has? *symbol-table* symbol-name)
      (hash-set! *symbol-table* symbol-name (make-symbol symbol-name)))
   (hash-ref *symbol-table* symbol-name))

This code returns the symbol with the given name in the global symbol table. If there's not already a symbol under that name in the global table, it creates a symbol with that name and stores it in the hash prior to returning it.. This ensures that make-symbol is only called once for each symbol-name, and the symbol stored in *symbol-table* is always the symbol returned for a given name. Any two calls to intern! with the same name are therefore guaranteed to return the exact, same, eq? symbol object. At a vCalc REPL, this looks like so (The fact that both symbols are printed with ##0 implies that they have the same identity.):

user> (intern! "test-symbol")
; ##0 = test-symbol
user> (intern! "test-symbol")
; ##0 = test-symbol

This design has several properties that have historially been useful when implementing a Lisp. First, by sharing the internal representation of symbols with the same print name, interning can reduce memory consumption. A careful programmer can write an implementation of interned symbols that doesn't allocate any memory on the heap unless it sees a new, distinct symbol. Interning also gives a (theoretically) cheaper mechanism for comparing two symbols for equality. Enforcing symbol identity equality for symbol name equality implies that symbol name equality can be reduced to a single machine instruction. In the early days of Lisp, these were very significant advantages. With modern hardware, they are less important. However, the semantics of interned symbols do still differ in important ways.

One example of this is that interned symbols make it easy to provide a global environment 'for free'. To see what I mean by this, here is the vCalc declaration of a symbol:

struct
{
     ...
     lref_t vcell;  // Current Global Variable Binding
     ...
} symbol;

Each symbol carries with it three fields that are specific to each symbol, and are created and initialized at the time the symbol is created. Because vcell for the symbol is created at the same time as the symbol, the global variable named by the symbol is created at the same time as the symbol itself. Accessing the value of that global variable is done through a field stored at an offset relative to the beginning of the symbol. These benefits also accrue to property lists, as they can also be stored in a field of a symbol. This is a cheap implementation strategy for global variables and property lists, but it comes at the cost of imposing a tight coupling between two distinct concepts: symbols and the global environment.

The upside of this coupling is that it encourages the use of global symbol attributes (bindings and properties). During interactive programming at a REPL, global bindings turn out to be useful because they make it easy to 'say the name' of the bindings to the environment. For bindingsthat directly map to symbols, the symbol itself is sufficient to name the binding and use it during debugging. Consider this definition:

(define *current-counter-value* 0)

(define (next-counter-value)
   (incr! *current-counter-value*)
   *current-counter-value*)

This definition of next-counter-value makes it easy to inspect the current counter value. It's stored in a global variable binding, so it can be inspected and modified during debugging using its name: *current-counter-value*. A more modular style of programming would store the current counter value in a binding local to the definition of next-counter-value:

(let ((current-counter-value 0))
  (define (next-counter-value)
    (incr! current-counter-value)
    current-counter-value))

This is 'better' from a stylistic point of view, because it reduces the scope of the binding of current-counter-value entirely to the scope of the next-counter-value function. This eliminates the chance that 'somebody else' will break encapsulation and modify the counter value in a harmful fashion. Unfortunately, 'somebody else' also includes the REPL itself. The 'better' design imposes the cost that it's no longer as easy to inspect or modify the current-counter-value from the REPL. (Doing so requires the ability to inspect or name the local bindings captured in the next-counter-value closure.)

The tight coupling between interned symbols and global variable bindings should not come as a suprise, because interning a symbol necessarily makes the symbol itself global. In a Lisp that interns symbols, the following code fragment creates two distinct local variable bindings, despite the fact that the bindings are named by the same, eq? symbol: local-variable.

(let ((local-variable 0))
   (let ((local-variable 0))
      local-variable))

The mismatch between globally interned symbols and local bindings implies that symbols cannot as directly be involved in talking about local bindings. A Common Lisp type declaration is an S-expression that says something about the variable named by a symbol.

(declare (fixnum el))

In contrast, a Clojure type declaration is a reader expression that attaches metadata to the symbol itself:

^String x

The ^ syntax in Clojure gathers up metadata and then applies it using withMeta to the next expression in the input stream. In the case of a type declaration, the metadata gets applied to the symbol naming the binding. This can be done in one of two ways. The first is to destructively update metadata attached to an interned symbol. If Clojure had done this, then each occurrance of symbol metadata would overwrite whatever metadata was there before, and that one copy of the metadata would apply to every occurance of the symbol in the source text. Every variable with the same name would have to have the same type declarations.

Clojure took the other approach, and avoids the problem by not interning symbols. This allows metadata to be bound to a symbol locally. In the Clojure equivalent of the local-variable example, there are two local variables and each are named by two distinct symbols (but with the same name).

(let [local-variable 0]
   (let [local-variable 0]
      local-variable))

The flexibility of this approch is useful, but it comes at the cost of losing the ability to store values in the symbols themselves. Global symbol property lists in Lisp have to be modeled using some other means. Global variable bindings are not stored in symbols, but rather in Vars. (This is something that compiled Lisps tend to do anyway.) These changes result in symbols in Clojure becoming slightly 'smaller' than in Lisp, and more well aligned with how they are used in moodern, lexically scoped Lisps. Because there are still global variable bindings availble in the language, the naming benefits of globals are still available for use in the REPL. It's taken a while for me to get there, but the overall effect of un-interned symbols on the design of a Lisp seems generally positive.

Older Articles...