Mike Schaeffer's Blog

Articles with tag: programming
March 20, 2006

If you ever manually work with Visual Studio 2003 projects (*.vcproj files), be aware that projects have both names and GUIDs, which are usually assigned by the IDE. If you try to duplicate a project in a solution by copying it to a new directory, renaming it, and adding the copy to your solution, the MSVC IDE will get confused since you didn't give your new project a new GUID. Other things might be effected by this confusion too, but the inter-project dependancy editor definately can't distinguish between the two same-GUID projects. This ends up making it impossible to correctly sequence the two project builds, and there's no clue in the UI about the cause of the problem, since the dependancy editor lists the two same-GUID projects under seperate names.

I don't know if MSBuild, in VS2005, is any better, but they claim to have made it more friendly to non-IDE use cases. The strange thing about this is that I'm not sure what purpose the GUID's serve: I'd think that having multiple projects of the same name would create a host of other problems that the GUIDs wouldn't solve by themselves. Combine that with the outright user hotility of string like this one, it's easy to wonder why the GUIDs are used.

{4051A65D-4718-41AE-8C94-6B1906EB4D77} = {4051A65D-4718-41AE-8C94-6B1906EB4D77}
March 20, 2006

I recently converted a bunch of accessor macros in vCalc over to inline functions. These macros are a legacy of vCalc's siod heritage, and generally looked something like this:

#define FLONM(x) ((*x).storage_as.flonum.data)

In this macro, x is expected to be a pointer to an LObject, which is basically a struct consisting of union, storage_as, and a type field that specifies how to interpret the union. vCalc has a set of these macros to refer to fields in each possible interpretation of the storage_as union. These little jewels remove a lot of reduntant storage_as's from the code, and generally make the code much easier to work with. However, like all C/C++ macros, these macros have finicky syntax, bypass typechecking, and have limited ability to be extended to do other things (like verifying type fields, etc.). Fortunantly, they are a perfect candidate to be replaced by C++ inline functions and references:

inline flonum_t &FLONM(LRef x) 
{
   return ((*x).storage_as.flonum.data);
}

Even better, with function inlining turned on, there's no apparant performance penalty in making this transformation; even with inlining off, the penalty seems pretty modest at 20-20%. In other words, inline functions works exactly as advertised. It works well enough, if fact, that I made the 'logical extension', and added some type checking logic.

inline flonum_t &FLONM(LRef x)
{
   assert(FLONUMP(x));
   return ((*x).storage_as.flonum.data);
}

This version of the code verifies that x is actually a flonum before trying to interpret it as such. Normally, the code using these accessor functions explicitly checks the type of an object before making a reference, but sometimes, due to coding errors, invalid references can slip through the cracks. With the old style macros, these invalid references could result in data corruption with no warning. With checks, there's at least an attempt to check for bad references before they are made.

Adding these checks proved profitable: they revealed three logic errors in about 5 minutes of testing, two related to reading complex numbers, and one related to macroexpansion of a specific class of form. Adding these type checks also killed performance, but that was pretty easy to solve by making the additional checks independently switchable.

January 20, 2006

Look, the tech industry is and always will be fucked up. They still
somehow manage to make a semi-usable product every once in a while. My Mac
is slow as a dog, even though it has two CPUs and cost $5000, but I use it
anyway because it's prettier and slightly more fun than the crap Microsoft
and Dell ship. But give me a reason to switch, even a small one, and I'm
outta here.
Dave Winer.

If you don't know who Dave Winer is, he pioneered the development of outlining software back in the 80's, developed one of the first system scripting tools for the Macintosh, invented XML-RPC, and invented the RSS specification you're probably using to read this blog post. I'm not trying to belittle this guy's point of view, but he's been responsible for developing several major pieces of consumer application software, designed a couple hugely significant internet procotols, and made some signficant money in the process. Most people should be so lucky.

December 14, 2005

I've been doing a lot of analysis of feeds and reports lately, and have come up with a couple suggestions for file design that can make feeds easier to work with. None of this should be earth shattering advice, but collectively it can mean the difference between an easy file to work with and a complete pain in the ...well you know.

  • Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
  • Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
  • Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
  • Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
  • If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
  • Include a header that describes the feed. - To fully understand the contents of a file, you really have to understand what it contains and where it came from. This is useful both in testing (did this report come from build 28 or build 29?) and in production (when was this file generated?) My suggestions for header contents include:
    • The version of the report specification
    • Name of the source application
    • Version of the source application (This version number should be updated with every build.)
    • Environment in which the source application was running to produce the report.
    • The date on which the report was run
    • If the report has an effective date, include it too.
  • Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.
Older Articles...