Spaces within

“Unix doesn’t works properly with filenames with spaces”. This assertion from a coworker of mine prompted my harsh reply: “Unix works perfectly with spaces, it’s just programmer sloppiness that prevents it from properly handling blanks”. I was true (as always, I would humbly add) but my pal wasn’t completely wrong, at least when talking about the specific subpart of Unix named shell. Although far superior than MS-DOS batch shell (that’s about the same command line you find in Windows) I bet it originated in more or less the same way – a core with functionalities clustered over during time in response of new needs or new opportunities.
Unix shell (be it bash, ksh, zsh or fish) is nowadays a powerful programming tool allowing the programmer to craft rather complex artifacts. This scope is for sure much broader than the one envisioned by first developers, and this turns out in multiple ways to do the same thing, different ways to do similar things and cryptic ways to do simple things.
The Unix command line conception dates nearly 40 years back in time! Things were pretty different, but I won’t annoy you with details, just leave your imagination wild… likely it was even worse. Fish is a recent attempt to overcome most of the shell problems, but it is not widespread as bash could be. As put by a professional some time ago: emacs may have tons of neat features, but you are SURE you’ll always find vi on any Unix, while you are not certain you’ll have emacs. So better invest your learning time on vi.
Well, back to shell. What’s wrong with blanks? The main problem is that a space is a valid character in a file name and, at the same time, it is a separator for command line arguments. Back to when every single byte could make the difference, it seemed the right thing to do to have optional quotes if the filename doesn’t contain any space. So you can write:

$ ls foo/

To get the listing of directory foo, but you have to write:

$ ls "bar baz/"

if you want the listing of directory “bar baz” (or you could escape the space with a backslash). This could be boring on interactive shells, but is usually overcome by the auto-completion feature (type ‘ba’ then tab and the line gets completed with the available options, in this case: bar\ baz).
From boring it turns in the range from annoying to irritating in shell scripts, where variables are not real variables like those you are used in high level languages, but just convoluted macros. For example:

a="bar baz"
ls $a

is processed and interpreted as:

ls bar baz

As you see the quotes disappears because they are processed by the assignment to put bar+space+baz in the ‘a’ variable. Once ‘$a’ is expanded, quotes are just forgotten memories. In order to write proper shell scripts you have to do something like:

a="bar baz"
ls "$a"

Of course this is error prone, because not only the syntax is valid, not only the script is likely to work perfectly in the simple test case the programmer uses to test the script, but also it is likely to work fine most of the times. After all the space character is used only by those Windows naïve users that aren’t aware of the blanks-hating-device they keep hidden under their desks.
Well, I proud myself of writing space-safe shell scripts, at least until I tried to write a script to find duplicated files on a filesystem.
The goal is simple, after many virtual relocations, multiple pet-projects and home-works, I have many files scattered around with the same content. It is not a matter of space saving, rather it is a question of order. Avoid redundant information or make sure that it is really the same stuff.
My design was to have a command similar to ‘find’, something that accepts any number of directories or files on the command line, such as:

$ find_dupes dir1/ dir2/ file …

Shell has two ways for operating this pattern – use the shift command or use one of the special variables $@ and $*.
The first way is useful if you want the shell to process one argument at time, while the latter is handy when you want to relay the command line to a command. In my case I wanted to pass the entire command line to the ‘find’ command, something like:

$ find $@ -type f -exec md5sum {} ;

This line works fine until a filename with space is encountered. In this case, since variables are indeed macros, a single argument with space is expanded into two (or more) distinct arguments. And there is no way to work around the limitation, unless you read the manual J. In this area, the discoverability of bash is quite lacking. The man page states that $* expands in the sequence of arguments separated by the first character of the IFS environment variable. E.g. if IFS is set to dash (‘-‘) and the command line has the following arguments foo bar baz, then $* expands to foo-bar-baz.
Conversely $@ expands to a space separated sequence of arguments, but if you enclose it in quotes, then single arguments are expanded in quotes. E.g. $@ expands to foo bar baz, and “$@” expands to “foo” “bar” “baz”. Eventually this is the solution.
So, basically it is true that Unix has no problem whatsoever with spaces inside filenames, it is also true that shell programming can handle them as well and ultimately is up to programmer sloppiness if the batch script fails, but it has to be recognized that a great effort and investment is required to the programmer to climb out his sloppiness.

Secrets and Lies

Some days ago I helped a coworker with an oddly behaving Makefile. I am a long time user of this tool and I am no longer surprised at ‘make’ doing the unexpected in many subtle ways. This time the problem was that a bunch of source files in a recursively invoked Makefile were compiled with the host C compiler rather than the cross-compiler as configured.Make, in the attempt of easing the poor programmer life, pre-defines a set of dependencies with a corresponding set of re-make rules. One of this implicit rule states how to build an object file (.o) from a C source file (.c). The rule is somewhat like:

%.o: %.c
    $(CC) -c $(CPPFLAGS) $(CFLAGS) $<

And by default, the CC variable is set to ‘cc’, i.e. the default C compiler on Unix systems. Bear in mind that this is a recursively invoked make, therefore it is expected to be hidden at least one level away from the programmer. In the other hand the build has configured the top level make to use the cross compiler arm-linux-gcc. The problem could happen also because ‘make’ has a local scope for variables, i.e. variables are not exported by default to the recursively invoked makefiles.
The hard part in spotting the problem is that everything works as expected, i.e. the build operation completes without a glitch a you are left wondering why your shared libraries is not loaded on the target system.
Once you know, the problem is easy fixed, but if you are an occasional Makefile user you may experience some bad hours seeking what the heck is going on.
Hiding isn’t always bad – you need to hide details for abstraction, consider complex objects as black boxes to simplify their handling. One of the three pillars of OOP is “encapsulation”, that basically translates as data opaqueness, the object user is not allowed to peek inside the used object.
The question rising is – how much “hiding” is good and how much is wrong?
The C compiler is hiding away from the programmer the nits and bits of assembly programming so that he/she can think to the problem with a higher level set of primitives (variables instead of registers, struct instead of memory and so on).
If you want to go up with the abstraction level you must accept two things:

  • you are losing control of details;
  • something will happen under the hood, beyond your (immediate) knowledge;

Going up another level we meet the C++ language, with a greater deal of things working below the horizon. For example constructors implicitly call parent class constructors; destructors for objects instantiates as automatic variables (i.e. on the stack) are invoked when the execution leaves the scope where the objects had been instantiated.
If you are a bit fluent in C++ these implicit rules are likely not to surprise neither to harm you. If you consider a traditional programming language such as C, Pascal or even Basic (!), you will notice quite a difference. In traditional language you cannot define code that is executed without an explicit invocation. C++ (and Java for the matter) is more powerful and expressive by hiding the explicit invocation.
In many scripting languages (such as Python, Lua, unix shell, PHP… I think the list could go on for very long) you don’t have to declare variables. Moreover if you use a variable that has not yet been assigned you get it initialized by default. Usually an empty string, a null value or zero, it depends on the language. This could be considered handy so that the programmer could save a bunch of keystrokes and concentrate on the algorithm core. I prefer to consider it harmful because it can hide one or more potential error. Take the following pseudo-code as an example

# the array a[[]] is filled somewhere with numbers.
while( a[[index]] != 0 )
{
    total += a[[index]];
    index++;
}
print total;

If uninitialized variable values can be converted to number 0, then the script will correctly print the sum of the array content. But, what if some days later I add some code that uses a ‘total’ variable before that loop?
I will get an hard to spot error. Hard because the effect I see can be very far from the cause.
Another possible error is from mistyping. If the last line would be written as:

print tota1;

(where the last character of “tota1” is a one instead of a lowercase L)

I would get no parsing and no execution error, but the total would be always computed as zero (or with some variations in the code, could be the last non-zero element of the a[[]] array). That’s evil.
I think that one of the worst implicit variable definition is the one made in Rexx. By default Rexx variables are initialized by their name in upper case. At least 0 or nil is a pretty recognizable default value.
Time to draw some conclusions. You can recognize a pattern – evil hiding aims to help the programmer to save coding time, but doesn’t scale, good hiding removes details that prevent the program to scale up.
As you may have noticed lately, the world is not black or white, many are the shades and compromises are like the Force – they could yield both a light side and a dark side. E.g. C++ exceptions offer the error handling abstraction, at the cost of preventive programming nearly everywhere in order to avoid resource leaks or worse.
Knowing your tools and taking a set of well defined idioms (e.g. explicitly initialize variable, or use constructor/destructor according to the OOP tenets) are your best friends.