C Strings, a C++ View

The worst form of inequality is to try to make unequal things equal – at least according to Aristotle. On the other hand, software engineering tries quite hard to deal with unequal things in the same way. Think for example to the file system concept, it is very handy to deal with data in your hard disk in the same way you deal with data stored on a server across a network connection. As Joel suggests pushing abstractions too hard may hurt, but I think it is hard to disagree that it is very convenient to produce video output regardless of the screen resolution or manufacturer.

So, wouldn’t it be nice to deal with strings regardless they are null-terminated C strings or C++ iterable string? Yes of course, but what does the compiler think about this? Would it generates comparable, if not equal, machine code?

First let me jot down some code.

class CStringConstIterator
{
    public:
        explicit CStringConstIterator( char const* text ) noexcept : m_scan{text}
        {}

        CStringConstIterator() noexcept : m_scan{nullptr}
        {}

        CStringConstIterator operator++() { ++m_scan; return *this; }
        char operator*() const { return *m_scan; }

    private:
        friend bool operator==( CStringConstIterator const& lhs, CStringConstIterator const& rhs ) noexcept;
        char const* m_scan;
};

bool operator==( CStringConstIterator const& lhs, CStringConstIterator const& rhs ) noexcept
{
    return lhs.m_scan == rhs.m_scan ||
            (lhs.m_scan == nullptr && *rhs.m_scan == '') ||
            (*lhs.m_scan == '' && rhs.m_scan == nullptr);
}

bool operator!=( CStringConstIterator const& lhs, CStringConstIterator const& rhs ) noexcept
{
    return !(lhs == rhs);
}

Quite straightforward, If you want the end iterator, you invoke the default constructor, otherwise you need to specify the pointer at the beginning of the string.

Internally the iterator is represented by a pointer, that is conveniently set to nullptr for the end iterator. I had a bit of thought on this since string end and null string pointer are quite different and I am afraid this could lead to some crookedness like NIL and Empty List in Lisp.

In this case, I think it is quite safe – if you build an iterator from a null pointer, then its iterator is promptly identified as an end iterator and you avoid the chance to dereference it.

At a first glance, the C++ code may seem to carry quite a burden when compared to the C version with no iterators and pointer arithmetic alone. So I decided to peek under the hood and have a look at the generated machine code.

Here are the two functions to compare:

void f1( char const* text )
{
    if( text == nullptr ) return;
    while( *text != '' )
    {
        std::putchar( *text );
        ++text;
    }
}

void f2( char const* text )
{
    if( text == nullptr ) return;

    auto scan = CStringConstIterator{ text };
    auto end = CStringConstIterator{};

    while( scan != end )
    {
        std::putchar( *scan );
        ++scan;
    }
}

f1 is the traditional C string loop, while f2 is the idiomatic C++ container iteration.

Thanks to CompilerExplorer this activity is quite easy:

CompilerARMX86Comments
GCC 10n/ax86When compiling with size optimization the resulting code is the same. If you enable speed (-O6) optimization, then the two resulting codes are slightly different. It is hard to tell which one is faster
GCC 9.2.1armUsing thumb code generation and size optimization, the two codes are the same. Speed optimization yields minimal differences.
CLANGarmx86Clang does an excellent job, regardless of optimization mode (-O6 vs. -Os), instruction set (-mthumb) the resulting assembly is always the same.
MSVC 19.27x86This is not the compiler I use in my daily job and it has been quite a while since the last time I used it.
In order to get comparable code, you need to specify /Ox (and /Og may improve things). Surprisingly instruction count is lesser in f2 (iterator) function.

Time to draw some conclusions. In ancient times, when Real Men, wrote Real Programs in C with nothing more than a bunch of punched cards, they could quite understand what the resulting assembly would like by a quick glance to the source code (possibly by looking at the holes in the punched card form).

Long are gone those times, our machines are ludicrously faster and our compilers are outlandishly better at their job of translating human (hopefully) readable source code into optimized assembly. Old-time intuition no longer works as shown by the example above.

Forecasting how much machine code will be generated, is harder today. You have to dig quite a bit the implementation (or thoroughly read the documentation, when available) to get the cost in terms of size and performance penalties of the source code you are looking at. And eventually, you have to look at the machine code, to confirm your expectations.

For these reasons, today is even truer that you should focus on making the program right well before making it fast. You need to be sure about the computational complexity of your solution, but small optimizations may be totally misleading or unnecessary.

Back on Earth, hey, you can handle zero-terminated C string in the same way C++ strings are handled.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.