Mike Schaeffer's Blog

Articles with tag: clojure

January 28, 2014

Clojure: lazy-seq and the StackOverflowException

I’ll get back to rpncalc shortly, but before I do, I wanted to take a post to talk about a surprising problem I recently had with lazy sequences. As part of my day job, I am developing a Clojure based system for accumulating and displaying time series data on a web page. One of the core algorithms in my implementation is a incremental merge sort. I have a function that takes two seq’s, both ordered by time, and produces a lazy result seq with all values from both inputs, also in time order. Every few seconds, as new input values are read from their sources, the program uses the ordered merge function to integrate the new values into a seq that contains a complete history of all values. It’s a straightforward and flexible design, and initially, it appeared to work quite well. The problems only started to arise after several hours of run time: traversing the history list would then immediately result in stack overflow exceptions.

If you’re familiar with lazy sequences, this may seem like an odd result. After all, one of the benefits of lazy sequences (aside from their laziness) is that they can eliminate recursion and reduce pressure on the stack. Lazy sequences might require more heap allocation, but they shouldn’t require all that much stack. To explore this idea a bit further, I’m going to use a simpler example than my merge function, I’m going to start with a recursive version of map:

(defn my-map [ fn xs ]
  (if (empty? xs)
     ()
     (cons (fn (first xs)) (my-map fn (rest xs)))))

For simple use cases, my-map has the same interface as the built-in function map:

user> (my-map #(+ 1 %) (range 3))
(1 2 3)
user> (map #(+ 1 %) (range 3))
(1 2 3)

The limitations of my-map start to become apparent for larger sequences:

user> (map #(+ 1 %) (range 10000))
(1 2 3 4 5 6 7 8 9 10 ...)
user> (my-map #(+ 1 %) (range 10000))
StackOverflowError   clojure.lang.Numbers.add (Numbers.java:1685)

What’s happening here is that the recursive call to my-map causes the function to allocate a stack frame for each element of the input sequence. The stack has a fixed size limit, so this places a fixed limit on the size of the sequence that my-map can manipulate. Any input sequences beyond that length limit will cause the function to overflow the stack. map gets around this through laziness, which is something that we can also use:

(defn my-map-lazy [ fn xs ]
  (lazy-seq
    (if-let [ xss (seq xs) ]
     (cons (fn (first xs)) (my-map-lazy fn (rest xs)))
     ())))

With this definition, all is right with the world:

user> (my-map-lazy #(+ 1 %) (range 10000))
(1 2 3 4 5 6 7 8 9 10 ...)

While the text of my-map and my-map-lazy is similar, the functions internally are quite different in operation. my-map completely computes the result of the mapping before it returns: it eagerly evaluates and returns the fully calculated result. In contrast, my-map-lazy doesn’t compute any of the mapping before it returns: it lazily defers the calculation until later and returns a promise to compute the result later on. The difference may be more clear, looking at a slightly macro-expanded form of my-map-lazy:

(defn my-map-lazy [ fn xs ]
  (new clojure.lang.LazySeq
     (fn* []
       (if-let [xss (seq xs)]
          (cons (fn (first xs)) (my-map-lazy fn (rest xs)))
         ()))))

The only computations that happen between the entry and exit of my-map-lazy are the allocation of a new lexical closure and then the instantiation of a new instance of LazySeq. While the body my-map-lazy still contains a call to itself, the call doesn’t happen until after my-map-lazy returns and the LazySeq invokes the closure. There is no recursive call, and there is no risk of overflowing the stack. (The traversal state that was stored on the stack in the recursive version is stored on the heap in the lazy version.)

So why was my merge sort overflowing the stack? To see why, I’m going to introduce a new function, using Clojure’s internal map function. This function serves no purpose, other than to introduce a layer of laziness. It is only useful for the purposes of this discussion:

(defn lazify [ xs ]
  (map identity xs))

Because the evaluation of map is lazy, we can predict that what lazify returns is a LazySeq. This turns out to be true:

user> (.getClass [1 2 3 4 5])
clojure.lang.PersistentVector
user> (.getClass (lazify [1 2 3 4 5]))
clojure.lang.LazySeq

Calling lazify on the result of lazify produces another LazySeq, distinct from the first.

user> (def a (lazify [1 2 3 4 5]))
#'user/a
user> (.getClass a)
clojure.lang.LazySeq
user> (def b (lazify a))
#'user/b
user> (.getClass b)
clojure.lang.LazySeq
user> (identical? a b)
false

Due to the way lazify is defined, the results of the sequences a and b are identical to each other - they both result in (1 2 3 4 5). However, despite the similarity in the results they produce, the two sequences are distinct and produce their results with different code paths. Sequence a computes the identity of each element of the vector [1 2 3 4 5] and sequence b computes the identity of each element of sequence a. Sequence b has to go through sequence a to get the value from the vector that underlies both. Even in lazy sequences, this process is still eager, still recursive, and it still consumes stack.

To confirm this theory, I’ll use another function that applies lazify to a sequence any number of times.

(defn lazify-n [ n seq ]
  (loop [n n seq seq]
    (if (> n 0)
      (recur (- n 1) (lazify seq))
      seq))

This function builds a tower of lazy sequences n sequences tall. Computing even the first element of the result sequence, involves recursively computing every element of each sequence in this tower, down to the original input seq to lazify-n. The depth of the stack required to maintain this recursive stack is proprortional to n. High values of n should produce sequences that can’t be traversed without throwing a stack overflow error. This turns out to be true:

user> (lazify-n 1 [1 2 3 4 5])
(1 2 3 4 5)
user> (lazify-n 4000 [1 2 3 4 5])
StackOverflowError   clojure.core/seq (core.clj:133)

Going back to my original merge sort stack overflow, it is caused by the same issue that we see in lazify-n. The calls to merge two lists don’t merge the lists at the time of the call. Rather, the calls produce promises to merge the lists at some later point in time. Every call to merge increases the number of lists to merge, and increases the depth of the stack that the merge process needs to use during the merge operation. After a while, the number of lists to be merged gets high enough that they can’t be merged without overflowing the stack. This is the cause of my initial stack overflow.

So what’s the solution? One easy solution is to give up some amount of laziness.

(defn lazify-n! [ n seq ]
  (loop [n n seq seq]
    (if (> n 0)
      (recur (- n 1) (doall (lazify seq)))
      seq))

The only difference between this new version of lazify-n and the previous is the call to doall on the fourth line. What doall does is force the full evaluation of a lazy sequence. So, while lazify-n! still produces an n high tower of lazy sequences, they’re all been fully traversed. Because LazySeq caches values the first time it’s traversed traversal, there’s no need to recursively call up the tower of sequences to traverse the final output sequence. This gives up some laziness, but it avoids both stack overflow issues we’ve discussed in this blog post: the overflow on long input sequence lengths and the overflow on deeply nested lazy sequences. The cost (there’s always a cost) is that this requires more heap storage than many alternative structures.

Tags:clojureksm

August 19, 2013

Clojure closures in Java

As part of a team conversation this morning, I worked up a quick Java translation of some more-interesting-than-it-looks Clojure code. It’s a good example of how lexical closures map to objects.

The code we started out with was this:

(defn make-adder [x]
  (let [y x]
    (fn [z] (+ y z))))
 
(def add2 (make-adder 2))
 
(add2 4)

What this code does is define and return a new type of function that adds values to the constant x. In Java, it looks like this:

// Not needed in Clojure... the Clojure runtime implicitly gives types
// that look similar to this to all Clojure functions.
interface Function
{
   Integer invoke(Integer z);
}
 
public class Foo
{
   // (defn make-adder [x]
   //   (let [y x]
   //     (fn [z] (+ y z))))
   public static Function makeAdder(Integer x)
   {
       final Integer y = x;
 
       return new Function() {
           public Integer invoke(Integer z) {
               return z + y;
           }
       }
   }
 
   public static int main(String[] args)
   {
       // (def add2 (make-adder 2))
       Function add2 = Foo.makeAdder(2);
 
       // (add2 4)
       System.out.println(add2.invoke(4));
   }
}

Five lines of Clojure code translate into 30 lines of Java: a function definition becomes a class definition, with state.

This idiom is powerful, but coming from Java, the power is hidden behind unusual syntax and terse notation. There are good reasons for both the syntax and the notation, but if you’re not used to either, it’s very easy to look at a page of Clojure code and be completely lost. Getting past this barrier by developing an intuitive feel for the language is a major challenge faced by teams transitioning to Clojure and Scala. One of the goals I have for my posts in this blog is help fellow developers along the way. It should be a fun ride.

Tags:clojurejavaksm

June 29, 2011

Symbol Identity in Lisp and Clojure

One typical property of Lisp systems is that they they intern symbols. Roughly speaking, when symbols are interned, two symbols with the same print name will also have the same identity. This design choice has several significant implications elsewhere in the Lisp implementation. It is also one of the places where Clojure differs from Lisp tradition.

In code, the most basic version of the intern algorithm is easy to express:

(define (intern! symbol-name)
   (unless (hash-has? *symbol-table* symbol-name)
      (hash-set! *symbol-table* symbol-name (make-symbol symbol-name)))
   (hash-ref *symbol-table* symbol-name))

This code returns the symbol with the given name in the global symbol table. If there's not already a symbol under that name in the global table, it creates a symbol with that name and stores it in the hash prior to returning it.. This ensures that make-symbol is only called once for each symbol-name, and the symbol stored in *symbol-table* is always the symbol returned for a given name. Any two calls to intern! with the same name are therefore guaranteed to return the exact, same, eq? symbol object. At a vCalc REPL, this looks like so (The fact that both symbols are printed with ##0 implies that they have the same identity.):

user> (intern! "test-symbol")
; ##0 = test-symbol
user> (intern! "test-symbol")
; ##0 = test-symbol

This design has several properties that have historially been useful when implementing a Lisp. First, by sharing the internal representation of symbols with the same print name, interning can reduce memory consumption. A careful programmer can write an implementation of interned symbols that doesn't allocate any memory on the heap unless it sees a new, distinct symbol. Interning also gives a (theoretically) cheaper mechanism for comparing two symbols for equality. Enforcing symbol identity equality for symbol name equality implies that symbol name equality can be reduced to a single machine instruction. In the early days of Lisp, these were very significant advantages. With modern hardware, they are less important. However, the semantics of interned symbols do still differ in important ways.

One example of this is that interned symbols make it easy to provide a global environment 'for free'. To see what I mean by this, here is the vCalc declaration of a symbol:

struct
{
     ...
     lref_t vcell;  // Current Global Variable Binding
     ...
} symbol;

Each symbol carries with it three fields that are specific to each symbol, and are created and initialized at the time the symbol is created. Because vcell for the symbol is created at the same time as the symbol, the global variable named by the symbol is created at the same time as the symbol itself. Accessing the value of that global variable is done through a field stored at an offset relative to the beginning of the symbol. These benefits also accrue to property lists, as they can also be stored in a field of a symbol. This is a cheap implementation strategy for global variables and property lists, but it comes at the cost of imposing a tight coupling between two distinct concepts: symbols and the global environment.

The upside of this coupling is that it encourages the use of global symbol attributes (bindings and properties). During interactive programming at a REPL, global bindings turn out to be useful because they make it easy to 'say the name' of the bindings to the environment. For bindingsthat directly map to symbols, the symbol itself is sufficient to name the binding and use it during debugging. Consider this definition:

(define *current-counter-value* 0)

(define (next-counter-value)
   (incr! *current-counter-value*)
   *current-counter-value*)

This definition of next-counter-value makes it easy to inspect the current counter value. It's stored in a global variable binding, so it can be inspected and modified during debugging using its name: *current-counter-value*. A more modular style of programming would store the current counter value in a binding local to the definition of next-counter-value:

(let ((current-counter-value 0))
  (define (next-counter-value)
    (incr! current-counter-value)
    current-counter-value))

This is 'better' from a stylistic point of view, because it reduces the scope of the binding of current-counter-value entirely to the scope of the next-counter-value function. This eliminates the chance that 'somebody else' will break encapsulation and modify the counter value in a harmful fashion. Unfortunately, 'somebody else' also includes the REPL itself. The 'better' design imposes the cost that it's no longer as easy to inspect or modify the current-counter-value from the REPL. (Doing so requires the ability to inspect or name the local bindings captured in the next-counter-value closure.)

The tight coupling between interned symbols and global variable bindings should not come as a suprise, because interning a symbol necessarily makes the symbol itself global. In a Lisp that interns symbols, the following code fragment creates two distinct local variable bindings, despite the fact that the bindings are named by the same, eq? symbol: local-variable.

(let ((local-variable 0))
   (let ((local-variable 0))
      local-variable))

The mismatch between globally interned symbols and local bindings implies that symbols cannot as directly be involved in talking about local bindings. A Common Lisp type declaration is an S-expression that says something about the variable named by a symbol.

(declare (fixnum el))

In contrast, a Clojure type declaration is a reader expression that attaches metadata to the symbol itself:

^String x

The ^ syntax in Clojure gathers up metadata and then applies it using withMeta to the next expression in the input stream. In the case of a type declaration, the metadata gets applied to the symbol naming the binding. This can be done in one of two ways. The first is to destructively update metadata attached to an interned symbol. If Clojure had done this, then each occurrance of symbol metadata would overwrite whatever metadata was there before, and that one copy of the metadata would apply to every occurance of the symbol in the source text. Every variable with the same name would have to have the same type declarations.

Clojure took the other approach, and avoids the problem by not interning symbols. This allows metadata to be bound to a symbol locally. In the Clojure equivalent of the local-variable example, there are two local variables and each are named by two distinct symbols (but with the same name).

(let [local-variable 0]
   (let [local-variable 0]
      local-variable))

The flexibility of this approch is useful, but it comes at the cost of losing the ability to store values in the symbols themselves. Global symbol property lists in Lisp have to be modeled using some other means. Global variable bindings are not stored in symbols, but rather in Vars. (This is something that compiled Lisps tend to do anyway.) These changes result in symbols in Clojure becoming slightly 'smaller' than in Lisp, and more well aligned with how they are used in moodern, lexically scoped Lisps. Because there are still global variable bindings availble in the language, the naming benefits of globals are still available for use in the REPL. It's taken a while for me to get there, but the overall effect of un-interned symbols on the design of a Lisp seems generally positive.

Tags:clojurelisptech