Mike Schaeffer's Blog

January 5, 2006

My alma mater, The University of Texas at Austin, won the national NCAA football championship last night. After 35 years, balance has been restored to the world...

I just wish I was in Austin tonight, the Tower will be in full regalia. Barring that, a live Tower cam will have to do.

January 3, 2006

Despite outward appearances, there is (really!) another release of vCalc in the works. I'm not going to be so silly as to speak to a timeline (probably 2006Q2), but here's a brief list of planned features for the next release or two:

  • Constant Library - A library of a few hundred constants.
  • Interface improvements - The current UI is functional but rather plain both in appearance and the interactivity it supports. The next version of vCalc will dress up the UI bit and start the process of making it more interactive.
  • Macro Recorder - To aid programming, there's a macro recorder that records sequences of commands as programs written in the language described above.
  • New Data Types - There are more first-class data types, including complex numbers, lists, tagged numbers, and programs.
  • User Programability - There's a user programming language including conditional branches, loops, and higher order functions. This language looks a lot like a lexically scoped variant of RPL, the language used by HP in its more modern calculators.
  • Better interoperability with other data sources - This means import/export of CSV, through both the clipboard and by file.
  • Financial Math - This is mainly planned to be Time Value of Money. There's actually interface in vCalc 1.0 to support this functionality, but I released vCalc before getting it to work reliably and disabled the code that implements it. This is going to be an ongoing area for devevlopment.
  • Infix notation - There needs to be a way to enter an expression like 'sin(x)'. This is both a programmability feature and the core of things like symbolic algebra and calculus.
  • Graphics - Function plotting.

In a more general sense, there are a few other issues that are important, but have a slightly lower priority level. These are general issues that are too big to be 'fixed' in one release, but nonetheless are important areas for work. The first of these is performance and the second is openness.

Performance is the easier of the two issues to describe: I want vCalc to be usable to interactively perform simple (mean, max, min, linear regrssion, historgram, etc.) analysis of datasets with 100K-1,000K observations of 10-20 variables each. The worst case scenario means that vCalc needs to be able to manage a in-memory image around 500-600MB in size and be able to compute 20-30M floating point operations within 5-10 seconds. That's a stretch for vCalc, but I think it's doable within a year or two. Right now, development copies of vCalc can reasonably manage 100K observations of 50 variables each. The biggest weakness is the CSV file importer, which is glacially slow: it reads CSV files at around 30K/second. I'll speak to these issuses in more detail later on, but the fix for this will be a staged rewrite of the Lisp engine and garbage collector at the core of vCalc.

The other issue that will have to be fixed over time is the issue of openness. One of the things I'd like this blog to be is a way to communicate with the audience of vCalc users. That means example code, demonstrations of how to use vCalc to solve specific problems, and descriptions of the guts of vCalc, at the very least. For that to be useful, there needs to be an audience, and for there to be an audience the vCalc bits need to be availble for people to use and good enough for them to care about using it. There's a lot to be done between here and there, but I've come to believe that open sourcing parts of vCalc and releasing more frequent development builds of the closed source parts will end up being key. We'll see.

December 14, 2005

I've been doing a lot of analysis of feeds and reports lately, and have come up with a couple suggestions for file design that can make feeds easier to work with. None of this should be earth shattering advice, but collectively it can mean the difference between an easy file to work with and a complete pain in the ...well you know.

  • Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
  • Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
  • Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
  • Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
  • If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
  • Include a header that describes the feed. - To fully understand the contents of a file, you really have to understand what it contains and where it came from. This is useful both in testing (did this report come from build 28 or build 29?) and in production (when was this file generated?) My suggestions for header contents include:
    • The version of the report specification
    • Name of the source application
    • Version of the source application (This version number should be updated with every build.)
    • Environment in which the source application was running to produce the report.
    • The date on which the report was run
    • If the report has an effective date, include it too.
  • Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.
November 16, 2005

A few months ago I wrote a bit on the impact of high resolution displays on the way Internet Explorer renders graphics. I had really planned on using the default setting. Not anymore!

This is the awful default:

This is as it should be:

Now, guess what Firefox does.

Older Articles...