A Type with Character

So, you know everything about C++ integral types, don’t you?

I thought I did, until when I enabled clang-tidy on my project. It all started with the rather innocent-looking warning:

warning: use of a signed integer operand with a binary bitwise operator

It looked somewhat less innocent when, examining the line, I saw no evidence of signs.

But, let’s start from the comfort zone (at least, my comfort zone):

unsigned x = 32;

The type of ~x (bitwise negation) is still unsigned. No surprise here, really obvious. The diligent programmer finds that the data fits in a smaller integer and writes:

   uint8_t s = 42;

Can you guess the type of ~s? Give it a try, really. Ready? Well, the type of ~s is… int. What? A quick check of other expressions involving uint8_t yields the same … fascinating result. Apparently these expressions are all converted into int.

In other words (and with a bit of syntax bending) uint8_t+uint8_t -> int, uint8_t<<uint8_t -> int, uint8_t+1 -> int. Let me rephrase that, in every expression an uint8_t type is converted to int.

Time for some Internet Duckduckgoing :-).

Back to our uint8_t (that is nothing but an unsigned char in disguise). When a char value, be it signed, unsigned or plain is encountered in an expression (in C++ standard jargon, a prvalue) it is promoted to int pretty much on every common CPU. On exotic architectures, char and int could have the same size, so int could not hold every possible value of char and therefore it is turned into an unsigned. From a strictly rigorous point of view, the signedness of the type of uint8_t in expression is machine-dependent. Keep this in mind if you aim to write really portable code 😉 (*)

You can find a very good (and somewhat frightening) explanation here.

But I’d like to add something beyond the standard technicalities and look at the reasons for why this is like this and what we can take home.

First, it is important to fix that in C (and C++ by extension) the int/unsigned type is mapped to the most efficient word of the target processor. The language provides that, as long as you use int/unsigned (without mixing them) in an expression, you get the most performing code.

Also, the language mandates that an int be at least 16 bits wide and at most the same size of a long.

What if you need to do math on 8bits data on a 32bits architecture? Well, you need assembly code to insert mask operation for cutting away the excess data in order to reach the right result.

So, the C language opts for the best performance turning everything into int, avoiding the extra assembly for masking and let the programmer arrange the expressions with cast here and there if anything different is desired.

Unexpected type promotion, sign change, and performance penalty should be three good reasons to avoid using anything different from int and unsigned in expressions (or long long when needed) and keep intXX_t and uintXX_t for storage.

Note that this applies also to function arguments. It happens quite frequently to read APIs where types for integer arguments are defined as the smallest integer capable of holding the largest value for a given parameter. That may seem a good idea at first since the API embeds in the interface the suggestion to the user for a proper range.

In fact, this has to be balanced against the aforementioned problems and don’t really enforce the constraints for two reasons – first, you can actually pass any integral type and get not even a warning and second, possibly your accepted range is a subset of all the possible value of the type, therefore the user is still required to read the documentation.

Finally, when in doubt, use the compiler 🙂 Finding types of expressions via the compiler could be not the most intuitive task. Below you find the code I used for my tests. Happy type-checking!

/* Linux only, sorry */
#include <iostream>
#include <cstdint>
#include <cxxabi.h>

int main()
{
    uint8_t a = 42;
    std::cout << abi::__cxa_demangle(typeid( a ).name(), nullptr, nullptr, nullptr) << "\n";
    std::cout << abi::__cxa_demangle(typeid( ~a ).name(), nullptr, nullptr, nullptr) << "\n";
    std::cout << abi::__cxa_demangle(typeid( 42u ).name(), nullptr, nullptr, nullptr) << "\n";
    std::cout << abi::__cxa_demangle(typeid( ~42u ).name(), nullptr, nullptr, nullptr) << "\n";
    return 0;
}

(*) Also keep in mind, if you strive to write a portable code, that uintXX_t types may not exist for exotic architectures. In fact, the standard mandates that if the target CPU has no 8 bits type then uint8_t be not defined for such target. You should use uint_least8_t (meaning a data type that has at least 8 bits) and uint_fast8_t (a data type that is the most efficient for operations with 8 bits).

it++ 2019

In Italy, June is usually that period of the year when the weather becomes hot and the Italian C++ conference is held in Milan.

That’s why (the conference, not the weather) on Saturday 15th June, my friend and coworker Francesco and I went to the Milan Polytechnic. As for the other conferences I attended, I wrote quite a large volume of notes. Regretfully, as it happened for the other conferences, reorganizing my notes to make some sense of them and present them in an understandable and useful way is out of reach for my homeopathic level of free time. Also, usually you can find online the slides and videos of the talks well before my post. Therefore I decided for the short format I already used a couple of times in the past – a brief overview describing the conference for what you can’t see in the video e a brief summary of the talks I attended to.

Read moreit++ 2019

++it18 – Channels are Useful, not only for Water

High performance computing made quite some progress lately. Maybe you heard about the reactive manifesto (you should if you are a reader of this blog ;-)), if not, it is an interesting readying … if nothing else as a conversational icebreaker.

Basically the manifesto preaches for systems more responsive and reliable by reacting to requests and dusting it into smaller requests processed by smaller and autonomous systems. This is nothing new, but quite far from traditional processing where single monolitic application took care of requests from reception to response.

There are several programming patterns useful for implementing a reactive system, channels (and streams which are quite the same thing) are one of them. A channel allows you to define a data flow with inputs, computation units, merge, split and collector of information coming in sequence into your system. Once the dataflow is defined, the data flows into it and results come out from the other end.

In this talk Felix Petriconi describes his implementation of channels for C++.

Read more++it18 – Channels are Useful, not only for Water

Protecting you from danger

One of the interesting features of Scala is the trait construct. In Java you have interfaces, a stripped down class that lets you express the requirements for classes that implement it to have a given … interface. Just to keep this from an OO perspective – in a certain system all MovableObjects, must expose an interface to retrieve, say, position, speed and acceleration:

interface MovableObject {
    Point getPosition();
    Vector getSpeed();
    Vector getAcceleration();
}

class Rocket implements MovableObject {
    public Point getPosition() {
        return m_position;
    }

    public Vector getSpeed() {
        return m_speed;
    }

    public Vector getAcceleration() {
        return m_acceleration;
    }
    private Point m_position;
    private Vector m_speed;
    private Vector m_acceleration;
}

In Java the interface has no implementation part, just a list of methods that heir(s) must implement. This is a way to mitigate the lack of multiple inheritance in the Java language – a class can inherit at most from a single base class, but can implement as many interfaces as needed.

Scala traits are different in the sense they can provide an implementation part and fields. Now that I read the wikipedia definition, it looks like that traits are a bit more mathematically sound than interfaces (that’s what you can expect in functional language). Anyway, in plain English, traits are very close to C++ base classes (with some additions that is stuff for another blog post) in the sense that you can both require interface and provide common functionality.

Read moreProtecting you from danger

++it18 – Zero Allocation and No Type Erasure Futures

This talk by Vittorio Romeo tackles a subject that could go ignored if your programming style doesn’t feel the need for parallelism (or if it does, pthreads satisfies such a need). Futures have been introduced in C++ with 2011 ISO standard, they hardly can be deemed as novelty.

I’ve been hit by futures some 4 years ago when confronted with a large code base in Scala that exploited futures and actors. So, at least to save my ego, there is nothing wrong if you don’t know what a future is :-).

In brief a future is a way to model a concurrent operation. The future can be either in execution or succeeded with a result value, or failed with an error. As user, you request an async operation and you get a future of the expected result. You can set callbacks on the completion, or combine several futures together either by chaining or by synchronize them. As you may imagine futures vastly simplify the handling of async computation by hiding the synchronization and communication details and by providing a common and clear usage pattern.

So, no excuse, if you need to do async programming, you’d better know this tool.

Read more++it18 – Zero Allocation and No Type Erasure Futures

++it18 – Time Travel Debug

This talk presented a very interesting object – the debugger the could go backward.  But I don’t want to steal the scene, since Paolo Severini from Microsoft explained everything quite well in his talk. Here you can find the slides. Nedless to say errors are mine.


Everyone knows that debugging is twice as hard as writing the code in the first place. So if you’re as clever as you can be when you write it, how long will it take to debug it?

Debugging could be both hard and time consuming. Bugs can be difficult to replicate, especially those involving memory corruption, race conditions, or those that happen randomly or intermittently.

More often than not, it can be hard to spot the cause from logs or crash dumps. Also step by step debugging may require several restarts.

When doing step-by-step debugging is quite common something like:

step into, step over, step over, step over, step ov.. @!# I should have stepped into!.

The more information we can get from debugging tools the better. A tool that could tell “what just happened” would be very very useful.

Reverse debugging is such a tool. The idea is to record state by state everything happens in the CPU that’s executing the code and then the reverse debugger allows the programmer to play back and forth to spot the problem.

First ideas of such a tool dates back to 70s [NdM. I keep noticing that most of the computer science has been invented back in the 60-70s years. They just lacked the CPU resource for practical applications].

There are several tools for reverse debugging native code:

In this speech I will present TTD.

TTD is the reverse debugger for Windows. It is fully integrated with WinDbg. TTD can be used for all user-mode processes, for multithread and multicore

In order to use TTD, you need to set WinDbg to record data. This option needs to be explicitly enabled since has two main drawbacks – first all the program execution is slowed down and then quite a large amount of data is produced. You can expect anywhere between 5 and 50 Mb/s.

After recording has started, debugging works as expected, plus you can step back and forth and also continue backward or forward.

Take note that TTD works only in the latest versions fo Windows 10.

Interestingly you can perform queries on data collected during execution. Every line in the database matches an assembly instruction executed. So you can query how many times some instruction has been executed, how many exceptions have been thrown and so on.

[NdM: A working demo follows, unfortunately the time is quite constrained. A sample program to recursively compute the byte size of file in a directory tree is presented. The program just crashes when executed. So it is executed in WinDbg with TTD enabled. The execution halts as usual in the middle of exceptionland where nothing can be understood unless you speak advanced assembly. By pressing back a couple of time the debugger, going backwards, reaches the last function executed, the one who crashed. At this point Paolo places a breakpoint at the beginning of the function and – with a good implementation of the WOW effect – runs the program backwards. The execution back-reaches the function beginning and can be step forward to better understand why it crashed]


The presentation has been very interesting to show the tool capabilities. I tried an evaluation of UndoDB back in 2006, but it was as uncomfortable as it could be since it had to be used from command line. WinDbg+TTD is light years ahead in terms of usability.

Also I found very interesting the data query feature. Having the db of the program execution states allows the programmer to get a number of important statistics and events without the need of adding instrumentation or specific breakpoints.

The drawback is the execution time, but I guess that without a specific hardware support this is the price we have to pay still for some time in the future.

++it 2018 – Coroutines – coming soon… ?

I really enjoyed this talk and not only because the speaker is an old friend on mine. I met Alberto back in the Ubisoft days and he was and still is one of the best C++ programmers I know and for sure the only friend who submitted a proposal to the standard committee that got approved. On the other hand I have no friends who submitted a proposal to the standard committee that was refused. So all my friends who submitted… well I’m digressing.

This talk was in Italian, that is welcome, but… weird. Usually all talks are in English even for Italian conference. Anyway here’s my summary, as usual errors and mistakes are mine. You can find Alberto’s slides here.


Coming Soon to a Compiler Near You – Coroutines

Alberto is a C++ enthusiast programmer since 1990. In this hour he’s going to talk about coroutines, their history and their relation with C++ and C++ Technical Specifications (TS).

A suspend and resume point is a point in the instruction sequence where the execution can be suspended, other stuff is executed and then the execution is resumed from there.

The most common form of suspend and resume point is a subroutine call. The execution is transferred from the caller to the callee and when the callee is done, the execution is transferred back to the caller.

Coroutines are a generalized form of suspension and resume points, that allow you to manage where you want to suspend, transfer control and resume, without be constrained in the caller/callee format.

When is this pattern useful?

Read more++it 2018 – Coroutines – coming soon… ?

++it 2018 – Key Notes

Saturday 23 June 2018, was a sunny day in Milan. The ideal day to close yourself in Milano Bicocca University with many equally nerdy fellow programmers to attend the Italian C++ conference.

I discovered this conference, now in its 3rd edition, by chance, skimming through my neverending queue of unread tweets. A friend of mine and C++ guru twitted about this, so I bookmarked the website and as soon as this year edition was announced I was one of the first to register.

Also it is worth noting that this conference is free (as in beer), lasts a full day and includes the lunch. Talks are interesting and the conference is organized in two tracks with a single starting keynote and a single closing speech.

Although free, there are famous guests – last year Bartosz Milewski – quite a famous functional programmer and category algebra expert – held a speech about C++ and functional programming; this year the keynote was by Peter Sommerlad — C++ expert and member of the C++ ISO committee.

As usual my good intentions are to write and publish about each single talk I attended to.

Comparison with other conferences I attended, such as Scala Days, Scala Italy and Lambda World, is natural, although not always fair – ++it is completely free, while others come to a price (not light in case of Scala Days). I couldn’t fail to notice that age on average was older here than in the functional world. I won’t try an interpretation, but there could be more than one.

But, enough with babbling, let’s start with the keynote (based on my notes taken during the talk, errors or misunderstandings are just mine). You can find Peter Somerlad’s slides here. My comments at the end.


A – B – C : Agile, BMW, C++

In this talk prof. Somerlad basically tells about his professional career, era by era, providing a short list of lessons learned for each period.

He started programming on a HP pocket calculator. This device had some 50 steps of program instruction and no graphics whatsoever. Also TI calculators were more widespread at time, so he had to translate programs from one programming language to another one. Although very similar (they were all stack based) each calculator had some peculiarities.

One program he developed for fun was a ballistic simulation – define a force and a direction to fire a bullet that has to avoid a wall and hit a target. As said, the pocket gizmo has no graphics, but years later, the very same idea was re-instantiated with some graphics and sounds to become… Angry Birds.

Lessons learned:

  • Mentally execute code – there was no debugger;
  • Using the stack;
  • Reading foreign code (to translate from TI to HP);
  • Porting code;
  • Old simple ideas reamain useful.

The first home computer was a ZX Spectrum – an 8bits computer with rubbery keyboard that connected to a TV set and used a cassette recorder as mass storage. Each key was decorated with several keyword beside the standard letter and symbol you normally find on a key. When pressing a key, depending on the context, one of such a keyword appeared on the screen – automatically typed for you. Context was determined by pressing two shift keys. The correct combo produced the required effect.

At school he worked on a mainframe, that had a curious assembly language. Notable one instruction, by exploiting an indirection bit, could be used to create an infinite loop. He and another one student tried the instruction and the mainframe froze. After few seconds an angry System Administrator entered the room, asking them to quit what they were doing.

Lessons learned:

  • Basic, Z80 assembly and machine code, Pascal;
  • Keyboard shortcuts and meta layout;
  • Patience (loading from tape takes forever);
  • Dealing with errors from hardware.

In 1985 he landed his first job where he had to write a kind of customer management application on an Apple IIe. Now the software had to be professional and one requirement was that a cat could walk on the keyboard without causing a crash of the application. At times no commoditized databases were available and the few db engines available were very expensive. Since the Apple IIe had floppy discs, they may have to merge the two databases on two discs. The solution is to employ a merge sort, so he took the book from Niklaus WirthData+Functions=Algorithms” and copied the merge sort implementation. Unfortunately the merge sort algorithm on the book was not stable, so Peter had to rewrite it from scratch.

Implementing the DB was quite a nightmare for many technical constraints – computer memory, mass storage size and computing power. The PC was an IBM with MS-DOS 1.x. This version had no directories. The Pascal language had no dynamic strings – they had to be statically sized.

As computer technology progressed he switched to a 386 compatible and turned to minix. During this period, Peter used PMate – an editor that sported macros, so that statement frames could be automatically generated.

Lessons learned:

  • do not trust programming books and commercial libraries;
  • before putting code in your book, always compile and test it;
  • programming should be supported by fast and powerful editor;
  • you should not need to type all of your code;
  • generic data driven solution make code simpler
  • automate tests.

1987 C and Unix courses

During this period Peter started teaching classes about C and Unix. In this classes shell, system programming and C were taught. Also he had to teach how to unlearn things to Cobol Programmers.

In 1988 he wrote some C guidelines – some are still actual today:

  • a function is too long if it does not fit in a 80×24 terminal; functions with longer code tells a failure in teaching software engineering.
  • a function should not take more than 3-5 parameters, if you need more, then you have to logically group them in struct. Windows API had function with 7 or more parameters, that why Peter swore never to program this OS.
  • Everything is an int, unless it is a pointer or a double. (The reason why we need to use the typename keyword today is a legacy of this choice made by K&R).
  • Program abstract data types using opaque structs and pointers.
  • Do not use scanf() ever in production code, use sscanf() instead and check its result.

Lessons learned:

  • K&R C, Unix, sh, awk, sed, vi, nroff, make, lex, yacc,…
  • You learn the most by teaching.
  • One can program in fortran/basic/cobol in any language.
  • abstraction, layering, cohesion.
  • the power of plain text, regexp, pipelines and scripting. Searching log files for what went wrong.

In 1989 Peter graduated. His master thesis was about a Modula-2 frontend for the Amsterdam Compiler Kit. In this year, also the first experiments with C++ using the Zortech C++ compiler. Writing code with this environment was quite intensive since any error in the code caused the whole machine to die. These were days before Protected Memory.

Lessons Learned:

  • Compilers and AST with polymorphic nodes (polymorphism via union);
  • Virtual Machine Languages (p-code, em-code). Java bytecode is very close to em-code;
  • more RAM and CPU power make compilation much faster;
  • ADT works even within a compile;
  • lex and yacc can be used in practice.

Conscript at Bundenswehr (the German Army). Peter worked in the Computer Center, on systems for Radar detection and analysis. This was quite a travel back in time. The computer was an 18 bits with core memory, switches to input bits into the main memory and tape for mass storage.

Also Peter found another computer history milestone – the VT100 terminal. This terminal defined what ANSI escape codes are today.

The language used for programming the system was Jovial a mix of pascal and algol that was later replaced by Ada.

Lessons learned

  • Computer and software history can be interesting;
  • Even hardware can be used beyond its expected lifetime;
  • source control is important (and copying floppy disc is not source control)

In 1990 he landed a job at Siemens and started working in OO programming environment. Here he worked with Erich Gamma and Andrè Weinard [NdM: I tried to look up this but haven’t found anything relevant, maybe I got the name wrong].

The system Peter was working at was a framework for C++ written in C++, named OOPU, intended to replace TOROS_OBER a C macro based OO system used for car diagnostic computer. Unfortunately OOPU was killed by internal politics. The reason was that a single programming environment had to be used across several business units, and other business units didn’t want to pay for it.

In this context Peter wrote an object inspector, named OBER, that showed the relations among objects (this was before DDD was invented). From architectural point of view this application sported an UI separated from the backend.

In 1991 Peter wrote the first C++ guidelines, although some are still applicable, many are obsolete because the language evolved and things can be done better.

1993 was the year of the PLoP (Pattern Languages of Programming), this was still before the Gang of Four Book. In this conferences many opportunities wer to learn and understand architecture by trying to explain patterns.

1995 is the year of GoF. The most important design pattern are:

Do not use the singleton since C++ has no fixed sequence of initialization. If you need singleton, think about design first. If you need globals, just use globals [NdM, I understand that singleton has some problems, but I don’t see what is the alternatives – globals lacks of encapsulation].

In PLoPs and EuroPLoP, patterns had a community, where knowledge was shared, and it offerd an environment to try out ides. Discussing how stuff has to be written and why.

With other participant of the PLoP/EuroPLoP, Peter wrote the book POSA – Pattern Oriented Software Architecture. This is less famous than Design Pattern by the Gang of Four, but tackled the same matter.

Layers vs. Tiers -Layers are different level of abstraction , you can swap a layer implementation with another; while in tiers, functionality spread on multiple tiers, all the the same level of abstraction employing different technologies.

Using layers the team can specialize, while using tiers the team needs to know everything.

Good abstraction is a skill. Too many layers are a problem if layers do nothing but pass values.

Lessons learned:

  • A prophet has no honor in his own country – get out to be heard;
  • good patterns require several rewrites;
  • Sharing teaches you to become a better author;
  • Get a good copy editor, when publishing;
  • Network to learn – attend and speak at conferences;
  • most important part of a conference are the breaks.

In 1997 he moved to Switzerland to take the job of Erich Gamma, more precisely to work on a small server framework www.coast-project.org. While in charge Peter extended and improved the library and developed C++ Unit Test and test automation in doing so. He delivered high-quality solutions using Agile Software development and Wiki wiki web.

In these years Peter came to some interesting conclusions about high quality software. First high quality leads to better developer and better retainment of those developer. Unfortunately high quality leads to canceled maintenance contracts. If nothing goes wrong, the customer doesn’t see the point in paying for something that is never used.

Also, counter intuitively, high quality leads to less business opportunities. Since the contacts between customer and developer are minimal and usually don’t escalate to top level management. So the developer has less chance to meet and know the decision makers of the customer company.

Also CUTE – the test environment – has been created to replace macro-ish test.

Lessons learned

  • to advance you need to take risks;
  • do not be shy to use your network;
  • if you can not change your organization change your organization;
  • leading a team and evolving it is gratifying
  • producing software with a superior quality might be bad for business
  • working software lives longer than intended.

1998 and Agile Development. It became evident that following rituals is not adhering to values. You can do all the scrum rituals, but without committing to its values. When doing agile the most important things are:

  • Unit Tests and TDD;
  • Refactoring and Simple Design;
  • Rhythm and feedback, i.e. thinking about what you do and what you achieve;
  • Open and honest communication.

Lessons Learned

  • be self sufficient, but accept and appreciate help;
  • as a patient you are responsible for yourself.

Since 2004, Peter started teaching agile and continued teaching C++, concurrent and network programming. Also started working on IDE and tools, with the goal to eliminate bad software.

In 2005 many people thought that refactoring tool for C++ was impossible to write. Peter worked to prove them wrong, and Cevelop is the result.

[NdM: The talk ran out of time about now.]


My comments

Despite being sympathetic with Peter, since we shared so much in terms of computer history, programming and software engineering, I have to admit that his talk didn’t leave me with much. Also I fail to see the lack of some self-celebration. Maybe the most interesting thought was that quality software leads to less business. The rest of the talk was enjoyable, but I couldn’t see the connection with C++ or the future (or the history) of this language.

Optimization Time – Scala (and C++)

The Case

It is quite a feeling when you learn that your commit, a couple of months ago, broke the build is such a subtle way that it took so long to be detected. Possibly a more thorough testing and validation of the software would have caught it earlier, nonetheless it’s there and you and your coworkers are working hard to delimit the offending code and better understanding what caused the mess.

It turned out that some perfectly working code was taking too much time in one of the hot spot of our codebase. More precisely that code operated a conversion from an incoming data packet into a format suitable for data processing… about 2500 times a second on a modest hardware.

The code resulting in such poor performances that disrupted the device functionality seemed like a good idea in a very functional idiom –

@inline final val SAMPLES_PER_BLOCK = 6
@inline final val BLOCK_LENGTH = 8
@inline final val BLOCK_HEADER = 2

private def decodeFunctional(packet: Array[Byte]): Array[Int] = {
    val lowSamples: Array[Int] = packet.drop(BLOCK_HEADER).map(x => x.toUint)

    val msbs: Int = ((packet(0).toUint << 8) | packet(1).toUint) & 0xFFF
    Array.tabulate(SAMPLES_PER_BLOCK)(
        x => lowSamples(x) | (((msbs << 2 * x) & 0xC00) >> 2)
    )
}

Basically the incoming packet holds six 10-bits samples in 8 bytes and some bits-shifting’n’pasting is needed to rebuild the six sample array for later processing.

Once found that the problem was here I decided rewrite the code to get rid of the functional idiom and take an imperative/iterative approach:

private def decodeImperative(packet: Array[Byte]): Array[Int] = {
    // the code below is ugly, but it needs to be fast, don't touch,
    // unless you know what you are doing
    val samples: Array[Int] = new Array[Int](SAMPLES_PER_BLOCK)
    val msbs: Int = ((packet(0).toUint << 8) | packet(1).toUint) & 0xFFF
    var index: Int = 0
    while (index < SAMPLES_PER_BLOCK) {
        samples(index) = packet(BLOCK_HEADER + index).toUint |
                         (((msbs << 2 * index) & 0xC00) >> 2)
        index += 1
    }
    samples
}

(the comment is a colorful note I decide to leave for the posterity :-)).

The idea is to get rid of the read-only constraint imposed by functional programming and save all the temporaries.

Optimizing is a subtle art, and even if intuition may guide you, only measurement can tell whether the optimization succeeded. So I set up some speed test (see “methodology” below) to objectively assess my results.

The speed improvement was astounding: more than 7 times faster (7.6 to be precise).

Energized by this victory I decided to switch to heavy optimization weapon. Not being sure if scala/java compiler optimization really does its optimization homework, I opted for manually unrolling the loop in the code:

private def decodeUnrolled0( packet: Array[Byte] ) : Array[Int] = {
    val PACKET0 : Int = packet(0).toUint
    val PACKET1 : Int = packet(1).toUint
    Array[Int](
        packet(BLOCK_HEADER+0).toUint | ((PACKET0 << 6) & 0x300),
        packet(BLOCK_HEADER+1).toUint | ((PACKET0 << 8) & 0x300),
        packet(BLOCK_HEADER+2).toUint | ((PACKET1 << 2) & 0x300),
        packet(BLOCK_HEADER+3).toUint | ((PACKET1 << 4) & 0x300),
        packet(BLOCK_HEADER+4).toUint | ((PACKET1 << 6) & 0x300),
        packet(BLOCK_HEADER+5).toUint | ((PACKET1 << 8) & 0x300)
    )
}

I was somewhat disappointed to learn that this version is only marginally faster (x1.6) than the original functional version. This didn’t really make sense to me, since the loop is completely unrolled and just trivial computation needed to be performed. So I decompiled the .class file:

public int[] CheckPerf$$decodeUnrolled0(byte[] packet)
{
  int PACKET0 = UnsignedByteOps(packet[0]).toUint();
  int PACKET1 = UnsignedByteOps(packet[1]).toUint();
  return (int[])Array..MODULE$.apply(Predef..MODULE$.wrapIntArray(new int[] {
    UnsignedByteOps(packet[2]).toUint() | PACKET0 << 6 & 0x300, 
    UnsignedByteOps(packet[3]).toUint() | PACKET0 << 8 & 0x300, 
    UnsignedByteOps(packet[4]).toUint() | PACKET1 << 2 & 0x300, 
    UnsignedByteOps(packet[5]).toUint() | PACKET1 << 4 & 0x300, 
    UnsignedByteOps(packet[6]).toUint() | PACKET1 << 6 & 0x300, 
    UnsignedByteOps(packet[7]).toUint() | PACKET1 << 8 & 0x300 }), ClassTag..MODULE$.Int());
}

Instead of the Java array I expected, some Scala internals are used to create the array and then converting it into the Java thing. Possibly this was the cause of the slowness. By looking around at the decompiled code I found another function that just used a Java int array. So I rewrote the unrolled version accordingly –

private def decodeUnrolled1( packet: Array[Byte] ) : Array[Int] = {
    val samples = new Array[Int](SAMPLES_PER_BLOCK)
    val PACKET0 = packet(0).toUint
    val PACKET1 = packet(1).toUint
    samples(0) = packet(BLOCK_HEADER + 0).toUint | ((PACKET0 << 6) & 0x300)
    samples(1) = packet(BLOCK_HEADER + 1).toUint | ((PACKET0 << 8) & 0x300)
    samples(2) = packet(BLOCK_HEADER + 2).toUint | ((PACKET1 << 2) & 0x300)
    samples(3) = packet(BLOCK_HEADER + 3).toUint | ((PACKET1 << 4) & 0x300)
    samples(4) = packet(BLOCK_HEADER + 4).toUint | ((PACKET1 << 6) & 0x300)
    samples(5) = packet(BLOCK_HEADER + 5).toUint | ((PACKET1 << 8) & 0x300)
    samples
}

The main difference is that here the array is first declared and allocated, then filled. In the above code the array was created, initialized and returned all in the same statement.

The speed improvement was good, but not much better than my imperative version: x7.77 times faster.

Then a colleague pointed out that I was using the “old” Scala 2.10 compiler and that I should try the latest 2.12 that benefits from a better interoperability with the underlying JVM 1.8.

In the following table you get the performance comparison:

 functionalimperativeunrolled0unrolled1
scala 2.10 times (lower is better)26.153.4016.603.36
scala 2.10 improvements over functional (higher is better)1x7.68x1.58x7.77
scala 2.12 times (lower is better)24.951.1713.30.021
scala 2.12 improvements over scala 2.10 functional
x1.05x22.32x1.97x1245

Functional and unrolled0 are quite unsurprising, just what you expect from the next version of the compiler. Imperative approach yields quite a boost – 22 times faster than the functional version on the previous compiler. The shocking surprise is in the last column – the unrolled1 version of the code runs in excess of 1200 times faster than the original code!

Before jumping to conclusions, I performed the same tests on the code translated into C++. Here is the equivalent code:

typedef std::array<int,SAMPLES_PER_BLOCK> OutputArray;
typedef std::array<uint8_t,BLOCK_LENGTH> InputArray;

OutputArray decodeImperative( InputArray const& packet )
{
    OutputArray samples;
    uint32_t msbs = ((packet[0] << 8) | packet[1]) & 0xFFF;
    for( int index=0; index< SAMPLES_PER_BLOCK; ++index )
    {
        samples[index] = packet[BLOCK_HEADER + index] |
                         (((msbs << 2 * index) & 0xC00) >> 2);
    }
    return samples;
}

OutputArray decodeFunctional( InputArray const& packet )
{
    OutputArray lowSamples;
    std::copy( packet.begin()+2, packet.end(), lowSamples.begin() );
    uint32_t msbs = ((packet[0] << 8) | packet[1] ) & 0xFFF;
    OutputArray samples;

    int index=0;
    OutputArray::const_iterator scan = lowSamples.begin();
    std::generate(
        samples.begin(),
        samples.end(),
        [&index,&scan,msbs]()
        {
            return *scan++ | (((msbs << 2*index++) & 0xC00) >> 2);
        }
    );
    return samples;
}

OutputArray decodeUnrolled0( InputArray const& packet )
{
    uint8_t PACKET0 = packet[0];
    uint8_t PACKET1 = packet[1];

    return OutputArray{
        packet[BLOCK_HEADER+0] | ((PACKET0 << 6) & 0x300),
        packet[BLOCK_HEADER+1] | ((PACKET0 << 8) & 0x300),
        packet[BLOCK_HEADER+2] | ((PACKET1 << 2) & 0x300),
        packet[BLOCK_HEADER+3] | ((PACKET1 << 4) & 0x300),
        packet[BLOCK_HEADER+4] | ((PACKET1 << 6) & 0x300),
        packet[BLOCK_HEADER+5] | ((PACKET1 << 8) & 0x300)
    };
}

OutputArray decodeUnrolled1( InputArray const& packet )
{
    OutputArray samples;
    uint8_t PACKET0 = packet[0];
    uint8_t PACKET1 = packet[1];

    samples[0] = packet[BLOCK_HEADER+0] | ((PACKET0 << 6) & 0x300);
    samples[1] = packet[BLOCK_HEADER+1] | ((PACKET0 << 8) & 0x300);
    samples[2] = packet[BLOCK_HEADER+2] | ((PACKET1 << 2) & 0x300);
    samples[3] = packet[BLOCK_HEADER+3] | ((PACKET1 << 4) & 0x300);
    samples[4] = packet[BLOCK_HEADER+4] | ((PACKET1 << 6) & 0x300);
    samples[5] = packet[BLOCK_HEADER+5] | ((PACKET1 << 8) & 0x300);
    return samples;
}

I’m positive that functional purist will scream in horror at the sight of C++ functional version of the code, but it is the closest thing I could do using the standard library. Should you have any idea on how to improve the functional idiom using the standard library, let me know and I’ll update my test set.

Here are the C++ results –

 FunctionalImperativeUnrolled 0Unrolled 1
C++ times (lower is better)0.320.0740.0530.049
Improvements over functional (higher is better)x1x4.36x6.1x6.6
Improvements over Scala 2.10 (higher is better)
x81x46x313x69
Improvements over Scala 2.12 (higher is better)x77x16x250x0.43

This is mostly unsurprising but for the the lower right column which seems to indicate that Scala version is faster than C++. I guess that the reason is because I am using timing values that are rebased considering a standard price you pay for executing the code under test and that I don’t want to consider in my measurements (see below about the methodology I used). My interpretation is that the overhead in Scala hides the cost for a trivial expression so that the base time is much more close to the execution time than it is in C++.

Conclusions

Drawing general conclusions from a specific case is all but easy and trivial. Anyway according to data and to intuition, if you want to code using the functional idiom you pay a price that sometimes can be hefty. Paying an execution price, be it memory or cpu cycles, to raise the abstraction level is something that has been with computer programming from the dawn of this craft. I remember echoes from the ranting against high level languages in favor of assembly to avoid paying a 30% penalty. Now those times are gone, but facing a x4-x7 overhead can be something not always you can afford.

(Raising the level of abstraction by employing the functional idiom is something that is not undisputed, but I will spare this for another post).

Writing efficient code is not always intuitive. Constructs that seem innocent enough turns out to be compiled into time consuming lower level code. Peeking at compiled code is always useful. While languages as C++ are accompanied by detailed information about performances of various constructs and library components, Scala (as Java) lacks of these information. Also the huge amount of syntactic sugar employed by Scala does not simplify the life of the programmer that has to optimize code.

Eventually I cannot save to remarks the performances of interpreted languages (and even more so functional interpreted languages), despite of the progress in the JVM technologies, are still way behind native language ones. Even employing a functional approach C++ is some 80 times faster than Scala. You can argue that the C++ code above is not really functional, but there is nothing preventing me to use a mostly functional library for C++, with roughly the same performances of the functional version of the C++ code. It could be more tedious to write because of the lesser grade of syntactic glucose in C++, yet it is still functional.

I’m curious to see how scala-native compares, but this again is a topic for another post.

The last thought I want to report here is that you can make Scala code much faster, by avoiding the functional paradigm and peeking at compiled code. It is something you are not usually expected to do, but and improvement of 1200 times is a worth gain for the most executed code.

A very last thought for those who think that after all nowadays we have such powerful computers that analyzing performances and optimizing is a waste of time. I would like to remember that data processing requires energy and today energy production mostly relies on processes that produce CO2, something that the world doesn’t need. A data processing that runs just twice the speed of another will produce half of the carbon dioxide emissions. The environment benefits for switching to C and C++ should be clear to everyone.

Methodology

A couple of words on methodology. The compilers were Scala 2.10.6, Scala 2.12.2 and GNU c++ 7.3.1. All the test were performed on Linux (Fedora Core 27) using the command /bin/time –format=”%U %S”  command to print the User and Kernel times of the command and summing the two values together. These are times that CPU spent executing the process, not wall clock. The decoding function is executed 0x7FFFFFFF (some 2 billions) times per run and each program is tested for 12 runs. The fastest and the slowest run are discarded and the other are used to compute an average that is considered the measure of the test.

The base time is computed by employing a dummy decoder that does nothing but filling the result array with constants.

This setup should take in account the JIT effect and measure just the code I listed above.