Mike Schaeffer's Blog

January 9, 2008

In my career, I've done a bit of switching back and forth between Emacs and various IDE's. One of the IDE features I've come to depend on is quick access to the compiler. Typically, IDE's make it possible to compile your project with a keystroke, and then navigate from error to error at the press of a key. It's easy to recreate this in Emacs. The following two expressions make Emacs work a lot like Visual Studio in this regard.

(global-set-key [(shift f5)] 'compile)
(global-set-key [f12] 'next-error)

After these forms are evaluated, pressing Shift-F5 invokes the compile command, which asks for a command to be run in an inferior shell, typically make, ant, or some other build utility. The catch is that it runs the command in the directory of the current buffer, which implies that the build script can be found in the same directory as the current source file. For a Java project with a per-package directory hierarchy, this is often not true. There are probably a bunch of ways to fix this, but I've solved it with a Windows NT batch file, ant-up.bat, that repeatedly searches up the directory hierarchy for build.xml. I just compile with ant-up, rather than a direct invocation of ant. This is not the most elegant solution, I'm sure, but it took about five minutes to implement and works well.

@echo off

setlocal

:retry

set last_path=%CD%

echo Searching %CD% ...

if exist build.xml goto compile-now

cd ..

if "%last_path%"=="%CD%" goto abort

goto retry

:compile-now

call ant -k %1 %2 %3 %4 %5

if errorlevel 1 goto failure

goto success

:abort

echo build.xml not found... compile failed

:failure

exit /b 1

:success

exit /b 0
January 8, 2008

Lately, I've been thinking a bit about the way language design influences library design. My line of thought started out inspired by some of the recent conversations about closures in Java, but it ended up also touching on dynamic typing and a few other 'modern' language features. This will end up being more than one post, but I thought I'd record some of it in my blog, with the hope that it might shed some light for somebody, somewhere.

To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if `strcmp(internstring(a), internstring(b)) == 0 holds true, then internstring(a) == internstring(b)` also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.

Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:

  • int hash_has(hash_table_t ht, char *key) - Returns TRUE or FALSE, depending on whether or not the key key is found in the hash table ht.
  • char *hash_ref(hash_table_t ht, char *key) - Returns the value bound to the key key by the hash table ht. If the key is not found, behavior is undefined.
  • char *hash_set(hash_table_t ht, char *key, char *value) - Binds the value value to the key key in the hash table ht. If the key is already present, the existing value is overwritten. This function returns value.

Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.

hash_table_t intern_table; // assume this is initialized somewhere else.

char *intern_string(char *str)
{
  if (hash_has(intern_table, str))
     return hash_ref(intern_table, str);

  char *interned_str = strdup(str);

  hash_set(intern_table, interned_str, interned_str);

  return interned_str;
}

The first step of intern_string is to check to see if the intern table already contains a string with the value of the new string. If the new string is already in the intern table, then the function returns the copy that's in the table. Otherwise, a new copy of the incoming string is created and stored in the hash table. In either case, the string returned is in the the intern table. This logic ensures that every time intern_string is called with a str of the same value, it returns the same exact string.

If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.

Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:

hash_table_t intern_table;

char *intern_string(char *str)
{
  char *interned_str = hash_ref(intern_table, str);

  if (interned_str == NULL)
  {
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

For this string interning, this change to hash_ref interface works fairly well. We know that we'll never store a hash key with a NULL value, so we know that NULL is safe to use for signaling that a key was not found. Were this ever to change, this version of hash_ref doesn't return enough information to distinguish between the 'key not found' case and the 'NULL value' case. Both would return NULL. To fix this, hash_ref needs to be extended to also return a seperate value that indicates if the key was found. This can be done in C by having hash_ref return the 'key found' flag as a return value, and also accept a pointer to a buffer that will contain the key's value, if it's found:

hash_table_t intern_table;

char *intern_string(char *str)
{
  char *interned_str;

  if (!hash_ref(intern_table, str, &interned_str))
  {
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

This is probably about as good as you can get in straight C. It easily distinguishes between the 'no-value' and 'no-key' cases, it's relatively clear to read, and it uses the common idioms of the language. However, C is a relatively sparse language. If you're willing to switch to something a bit more expressive, you have other choices.

One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:

(define *intern-table* (make-hash :equal))

(define (intern-string str)
 (let ((interned-str (hash-ref *intern-table* str 42)))
  (cond ((= interned-str 42)
          (hash-set! *intern-table* str str)
           str)
        (#t
          interned-str)))))

If you prefer C/JavaScript-style syntax, it looks like this:

var intern_table = make_hash(EQUAL);

function intern_string(str)

{
   var interned_str = hash_ref(intern_table, str, 42);

   if (interned_str == 42)
   {
       hash_set(intern_table, str, str);
       return str;
   }

   return interned_str;
}

In this case, hash_ref has been extended with a third argument: a default return value if the key is not found. The above code uses this to have hash_ref return a number in 'no value' case, and it's the type itself of this return value that ensures its distinctness. This is a common dynamic language idiom, but for a moment, consider what it would look like in C:

hash_table_t intern_table;

char *intern_string(char *str)
{
  char *interned_str = hash_ref(intern_table, str, (char *)42);

  if (interned_str == (char *)42)
  {
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

At first, this actually seems like it might a plausible implementation of intern_string. My guess is that it might even work most of the time. Where this implementation gets into trouble is the case when an interned string might reasonably be located at address 42. Because C is statically typed, When hash_ref returns, all it returns is a pointer. The caller cannot distinguish between the 'valid string at address 42' return value and the 'no-key' return value. This is basically the same problem as the case where we overloaded NULL to signal 'no-key'.

The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.

August 14, 2007

For the better part of my career, I've straddled the fence between pure software development roles and consulting roles. My first job after graduating in 1997 was at a hardware firm developing embedded software for process control hardware. Over the three years I was working there, I was on point for everything from register-level device drivers running on the custom hardware to a device configuration GUI running on a PC. By the time I left the firm, I even had the opportunity to write a small programming language, complete with compiler and 'VM'. This was a perfect job for a guy with a freshly-minted Computer Science degree and a hankering to write some cool code. At least it was perfect at first.

What ultimately drove me to leave that role is something that I think most technical jobs, particularly those in product development, have in common: a severe risk of detachment from your clients. Software developers, myself included, tend to be an introverted lot; Even if they weren't, it's oftentimes percieved to be in the best interests of a software house to keep software developers on task developing software. In other words, there are both personal and corporate pressures to keep developers hacking away at the code instead of talking to customers. The risk here is that the people who best understand the products are the people that potentially least understand the customers. I can tell you from firsthand experience that, while I knew in detail the sampling characteristics of the device I was building, I had no idea how it might be used in a chemical plant to measure a temperature sensor and control a pump or a heater. It's easy to develop a product in that kind of vacuum and then get it totally wrong for your market. If you're not careful, it's also easy to get your product totally wrong for your own company, which is what arguably happened to the group I was in.

Organizations counter this risk of specialization by having other job roles that can be more focused on customer needs and less focused. From the perspective of someone sitting in an R&D lab, these other jobs represent the steps a product takes out the door towards doing useful work for a customer. A researcher discovers a new technology or technique, a developer turns it into a product, a marketer figures out how to promote it, a salesperson sells the product, and finally, a consultant integrates it into the customer's system. As work moves down this path, it gets progressively more applied and progressively less abstract. The reverse is true too: the further away from pure research you get, the closer you get to customer requirements. As much as a development lab has the responsibility to push product out to customers, customer-facing staff has the responsibilty to push information on customer requirements back to the lab to drive product development priorites. If a developer isn't talking to a customer and a consultant is, then it's a safe bet that the consultant has a better idea of what a product needs to do to sell.

This is the reasoning that led me out of pure software development and into a consulting role at a different firm. In this new role, I was on projects developing configuration websites for computer resellers. If you envision Dell's online computer configuration tool, you're not far off from the kind of websites I was developing. While consultants at this company still did a bit of programming, the theory was that the heavy lifting of actually building the software was done in the company's R&D lab. Consultants were to focus on more basic customization and integration work. On my projects, most of my software devlopment work was limited to customizing the layout of web pages and writing interfaces to databases and authentication providers. Interesting stuff, but not close to the same technical league as what I was doing in my previous job.

The risks in this kind of internal consulting role are different than the risks in a purely development role. Unlike a developer sitting in an R&D lab, someone who might get to see a customer once a month or two at best, a consultant quite often is working on-site with the customer on a daily basis. In fact, It's easy for a consultant to see customer staff far more often than other employees of their own company. Of course, this potential isolation also includes the R&D lab and the sales group. In the worst case scenario, you end up with three independant silos in your organization: sales selling whatever they want, developers developing whatever they want, and the poor consultants caught in the middle, between an over-ambitious contract and a under-done base product. I shared an office with a guy working on a project that was sold as a one month customization of one of our other base products. When I joined the firm, this project was in its 18th month of (mostly non-billable) consulting time. There was obviously a lot that had gone horribly wrong on that project, but foremost was a total disconnect between what the salespeople thought they had ready to sell and what the developers had actually produced to sell. (Not that the consultants were blameless on that project either: there were huge estimation and process problems in the customization work plan.)

I do not know of a role that is completely free of these types of risks, but my experience has tended to be that the difference between success and failure in any role is more related to communication with those around you than it is to technical skills. It is as much about giving your stakeholders what they want when they want it as it is anything else (including giving them something 'great'). This can be a difficult pill to swallow since it places emphasis on skills that do not come naturally to many developers. If you're a developer used to setting the development agenda it's even worse, since it might involve ceding at least some of this power to people downstream and closer to customers. However, if you're really good, you will do whatever it takes (even it it's 'not your job') to know your customer's business well enough to anticpate what they need before they request. Either way, success is ultimately about pressing your boundaries beyond your comfort zone to get what you need to do the thing you love in a way that satisfies those that care about your work.

July 2, 2007

I wrote this list of good software books a few years ago, in an effort to write down books that I've found to be interesting and/or useful. None of these books are of the 'X in Y days' variety, and very few qualify as reference books likely to be useful in your day to day work. That said, move of them will still make you a better developer.

Software Structure and Algorithms

  • Code Complete: A Practical Handbook of Software Construction - Steve McConnell - This book is a classic reference on how code should be constructed, when working at the level of program statements and functions. While it does not cover some of the newer programming constructs or languages, this book should be considered essential reading for any working programmer.
  • Programming Pearls, 2nd Edition - Jon Bentley - As the author puts in the preface, this book talks about the more glamourous aspects of programming. Culled from a series of articles in Communications of the ACM, Bentley presents a series of insights on solving common programming problems.
  • The Art of Computer Programming, Vols. 1-3 - Donald Knuth - This is the series on the common algorithms used in software. Knuth presents, in gory detail, a series of algorithms and the mathematics on which they depend. You probably should not expect to read (or understand) everything that's here, but if you're in a bind, these books can be invaluable references. Robert Sedgewick also has an alternative that might be more accessible.
  • Compilers: Principles, Techniques, and Tools - Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman - The definitive reference on compiler design. If you have to do anything with expression parsing, analysis, or translation this is a good introduction.

C and C++

  • C Traps and Pitfalls - Andrew Koenig - C is a language notoriously full of little oddities that can make programmers' lives more difficult. This book documents a lot of them, explaining a little about why the language works the way it does. Reading it is a good way to help become a more seasoned user of C (and by implication C++). Expert C Programming by Peter van der Linden also appears quite good too, although I've not read it all the way through.
  • The Design and Evolution of C++ - Bjarne Stroustrup - Ever wonder why C++ works the way it does? This book, documenting the history of C++ and C with Classes, talks a lot about the design decisions made while the language was being developed. It's a dry read at times, but understanding the intent of the designers of a tool can really help while using the tool.
  • Advanced C++: Programming Styles and Idioms - James Coplien - In the C++ world, this book is old: It predates the ANSI standard, and many of the features introduced later on in the standardization effort (like STL). Despite that, Coplien describes techniques that give C++ programs a level of flexibility approaching that of languages like Lisp and Smalltalk. He covers garbage collection, multi-method dispatch, and replacment of code at runtime, among other techniques. If you need those capabilities, I'd reccomend another language. If you have to use C++, I'd reccomend this book.

Win32 Programming

  • Win32 Programming - Brent Rector and Joseph Newcomer - This is one of the most detailed books on the core Win32 API I've ever seen. If you need to work with Win32, you need good documentation, and this is the book to get. Not only does it talk in great detail about API functions, it also talks about some of the differences between the various platforms extant in 1997. I can't reccomend it enough, but I do hope it gets updated to include Windows 2000 and XP.
  • Windows++ - Paul DiLascia - First things first: this book is centered around Windows 3.1. While a lot has changed, a lot has stayed the same about the Windows API. Keeping that in mind, DiLascia's narrative through the process of encapsulating the Win16 API in C++ is a perfect example of how to manage the complexity of a difficult API using an object oriented programming environment. Even if you never write a Windows framework yourself, it's still good reading on the issues involved in both Windows programming and framework design in general.

Graphics and Visualization

  • Computer Graphics: Principles and Practice in C - Foley, van Dam, Femer, Hughes - A good introduction to a wide variety of graphics algorithms and techniques. Serious work in the field will require more specialized books, however.
  • The Visual Display of Quantitative Information - Edward R. Tufte - A good, and beautiful, discussion on how to graphically depict numerical information, and do so reliably and clearly.
  • Practical Reusable Unix Software - Edited by Balachander Krishnamurthy - Not really a visualization or graphics book, except for the material covering the AT&T GraphViz (dotty and lefty) toolset. If you're programming with data structures more elaborate than lists, this tool can be an easy way to interpret data structures dumped to disk. With lefty, it's even possible to program specific user interfaces to edit graphical displays.

"Advanced" Languages

  • Shared Source CLI Essentials - David Stutz, Ted Neward, and Geoff Shilling - This book is dense, but essential reading. With the two biggest emergent platforms (Java and .Net) depending on advanced runtime support, these more advanced runtime engines are now a fact of life. While this book focuses on the shared source implenentation of the .Net runtime, it is an excellent introduction to the ideas behind Java, .Net, and even languages like Smalltalk, Scheme, and Lisp.
  • The Art of the Metaobject Protocol - Gregor Kiczales, Jim des Rivieres, and Danial G. Bobrow - This book documents something called the Metaobject-protocol as found in the object oriented part of some Common Lisp implementations. If you need to use metaobjects, it's of course a useful book, but it's also informative in general about the implementation of dynamic languages. I particularly liked the coverage on method dispatch and selection.
  • Common Lisp: The Language, 2nd Edition - Guy Steele - While it's been superceded by the ANSI Hyperspec for Lisp, as a reference for Common Lisp, this is still a good reference to the language. For programmers in general, and particularly language implementors, this book is also full of good ideas on algorithms and software design.
  • On Lisp - Paul Graham - Macros (code-rewrite rules) are one of the hardest things to understand in Common Lisp, and other similar languages. Graham goes into a lot of detail about how they can be used in real-world development, and includes a bunch of good examples. It is available free of charge online.
  • Structure and Interpretation of Computer Programs - Harold Abelson, Gerald Jay Sussman, and Julie Sussman - This is one of those books that has the potential to seriously change the way you think. Written as a textbook for an introductory programming course at MIT, this book talks a lot about varying styles of computation and how they can be used.
Older Articles...