Mike Schaeffer's Blog

Articles with tag: tech

January 3, 2006

Despite outward appearances, there is (really!) another release of vCalc in the works. I'm not going to be so silly as to speak to a timeline (probably 2006Q2), but here's a brief list of planned features for the next release or two:

Constant Library - A library of a few hundred constants.
Interface improvements - The current UI is functional but rather plain both in appearance and the interactivity it supports. The next version of vCalc will dress up the UI bit and start the process of making it more interactive.
Macro Recorder - To aid programming, there's a macro recorder that records sequences of commands as programs written in the language described above.
New Data Types - There are more first-class data types, including complex numbers, lists, tagged numbers, and programs.
User Programability - There's a user programming language including conditional branches, loops, and higher order functions. This language looks a lot like a lexically scoped variant of RPL, the language used by HP in its more modern calculators.
Better interoperability with other data sources - This means import/export of CSV, through both the clipboard and by file.
Financial Math - This is mainly planned to be Time Value of Money. There's actually interface in vCalc 1.0 to support this functionality, but I released vCalc before getting it to work reliably and disabled the code that implements it. This is going to be an ongoing area for devevlopment.
Infix notation - There needs to be a way to enter an expression like 'sin(x)'. This is both a programmability feature and the core of things like symbolic algebra and calculus.
Graphics - Function plotting.

In a more general sense, there are a few other issues that are important, but have a slightly lower priority level. These are general issues that are too big to be 'fixed' in one release, but nonetheless are important areas for work. The first of these is performance and the second is openness.

Performance is the easier of the two issues to describe: I want vCalc to be usable to interactively perform simple (mean, max, min, linear regrssion, historgram, etc.) analysis of datasets with 100K-1,000K observations of 10-20 variables each. The worst case scenario means that vCalc needs to be able to manage a in-memory image around 500-600MB in size and be able to compute 20-30M floating point operations within 5-10 seconds. That's a stretch for vCalc, but I think it's doable within a year or two. Right now, development copies of vCalc can reasonably manage 100K observations of 50 variables each. The biggest weakness is the CSV file importer, which is glacially slow: it reads CSV files at around 30K/second. I'll speak to these issuses in more detail later on, but the fix for this will be a staged rewrite of the Lisp engine and garbage collector at the core of vCalc.

The other issue that will have to be fixed over time is the issue of openness. One of the things I'd like this blog to be is a way to communicate with the audience of vCalc users. That means example code, demonstrations of how to use vCalc to solve specific problems, and descriptions of the guts of vCalc, at the very least. For that to be useful, there needs to be an audience, and for there to be an audience the vCalc bits need to be availble for people to use and good enough for them to care about using it. There's a lot to be done between here and there, but I've come to believe that open sourcing parts of vCalc and releasing more frequent development builds of the closed source parts will end up being key. We'll see.

Tags:lisptechvcalc

December 14, 2005

Feeds and Reports

I've been doing a lot of analysis of feeds and reports lately, and have come up with a couple suggestions for file design that can make feeds easier to work with. None of this should be earth shattering advice, but collectively it can mean the difference between an easy file to work with and a complete pain in the ...well you know.

Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
Include a header that describes the feed. - To fully understand the contents of a file, you really have to understand what it contains and where it came from. This is useful both in testing (did this report come from build 28 or build 29?) and in production (when was this file generated?) My suggestions for header contents include:
- The version of the report specification
- Name of the source application
- Version of the source application (This version number should be updated with every build.)
- Environment in which the source application was running to produce the report.
- The date on which the report was run
- If the report has an effective date, include it too.
Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.

Tags:programmingtech

November 16, 2005

UseHR, I give up!!!

A few months ago I wrote a bit on the impact of high resolution displays on the way Internet Explorer renders graphics. I had really planned on using the default setting. Not anymore!

This is the awful default:

This is as it should be:

Now, guess what Firefox does.

Tags:graphicstech

November 9, 2005

Thirty days hath September...

Thirty days hath September,
the rest I can't remember.
The calendar hangs on the wall;
Why bother me with this at all?

— http://leapyearday.com/30Days.htm

Here's an Excel one liner that computes the number of days in a particular month. Cell A2 contains the year of the month you're looking for, Cell B2 contains the months' ordinal (1=January, 2=February, etc.):

=DAY(DATE(A2,B2+1,1)-1)

This is mainly useful to illustrate what can be done with Excel's internal representation of dates. Dates and times in Windows versions of Excel are normally stored as the number of days from January 1st, 1900. You can see this by entering a date in a cell, and then reformatting the cell to display as a number rather than a date. For example, this reveals April 1st, 2004 to be represented internally as the number 38078. This is because there are 38,078 days between January 1st, 1900 and April 1st, 2004.

The formula above relies on this in its computation of the number of days in a month. The sub-expression DATE(A2,B2+1,1) computes the date number for the first day of the month immediately following the month we're interested in. We then subtract one from that number, which gives us the date number for the last day of the month that we are interested in. The call to DAY then returns the number of the day within the month, which happens to be the number of days in the month.

Tags:exceltech