Month: March 2018

Optimization Time – Scala (and C++)

The Case

It is quite a feeling when you learn that your commit, a couple of months ago, broke the build is such a subtle way that it took so long to be detected. Possibly a more thorough testing and validation of the software would have caught it earlier, nonetheless it’s there and you and your coworkers are working hard to delimit the offending code and better understanding what caused the mess.

It turned out that some perfectly working code was taking too much time in one of the hot spot of our codebase. More precisely that code operated a conversion from an incoming data packet into a format suitable for data processing… about 2500 times a second on a modest hardware.

The code resulting in such poor performances that disrupted the device functionality seemed like a good idea in a very functional idiom –

@inline final val SAMPLES_PER_BLOCK = 6
@inline final val BLOCK_LENGTH = 8
@inline final val BLOCK_HEADER = 2

private def decodeFunctional(packet: Array[Byte]): Array[Int] = {
    val lowSamples: Array[Int] = packet.drop(BLOCK_HEADER).map(x => x.toUint)

    val msbs: Int = ((packet(0).toUint << 8) | packet(1).toUint) & 0xFFF
    Array.tabulate(SAMPLES_PER_BLOCK)(
        x => lowSamples(x) | (((msbs << 2 * x) & 0xC00) >> 2)
    )
}

Basically the incoming packet holds six 10-bits samples in 8 bytes and some bits-shifting’n’pasting is needed to rebuild the six sample array for later processing.

Once found that the problem was here I decided rewrite the code to get rid of the functional idiom and take an imperative/iterative approach:

private def decodeImperative(packet: Array[Byte]): Array[Int] = {
    // the code below is ugly, but it needs to be fast, don't touch,
    // unless you know what you are doing
    val samples: Array[Int] = new Array[Int](SAMPLES_PER_BLOCK)
    val msbs: Int = ((packet(0).toUint << 8) | packet(1).toUint) & 0xFFF
    var index: Int = 0
    while (index < SAMPLES_PER_BLOCK) {
        samples(index) = packet(BLOCK_HEADER + index).toUint |
                         (((msbs << 2 * index) & 0xC00) >> 2)
        index += 1
    }
    samples
}

(the comment is a colorful note I decide to leave for the posterity :-)).

The idea is to get rid of the read-only constraint imposed by functional programming and save all the temporaries.

Optimizing is a subtle art, and even if intuition may guide you, only measurement can tell whether the optimization succeeded. So I set up some speed test (see “methodology” below) to objectively assess my results.

The speed improvement was astounding: more than 7 times faster (7.6 to be precise).

Energized by this victory I decided to switch to heavy optimization weapon. Not being sure if scala/java compiler optimization really does its optimization homework, I opted for manually unrolling the loop in the code:

private def decodeUnrolled0( packet: Array[Byte] ) : Array[Int] = {
    val PACKET0 : Int = packet(0).toUint
    val PACKET1 : Int = packet(1).toUint
    Array[Int](
        packet(BLOCK_HEADER+0).toUint | ((PACKET0 << 6) & 0x300),
        packet(BLOCK_HEADER+1).toUint | ((PACKET0 << 8) & 0x300),
        packet(BLOCK_HEADER+2).toUint | ((PACKET1 << 2) & 0x300),
        packet(BLOCK_HEADER+3).toUint | ((PACKET1 << 4) & 0x300),
        packet(BLOCK_HEADER+4).toUint | ((PACKET1 << 6) & 0x300),
        packet(BLOCK_HEADER+5).toUint | ((PACKET1 << 8) & 0x300)
    )
}

I was somewhat disappointed to learn that this version is only marginally faster (x1.6) than the original functional version. This didn’t really make sense to me, since the loop is completely unrolled and just trivial computation needed to be performed. So I decompiled the .class file:

public int[] CheckPerf$$decodeUnrolled0(byte[] packet)
{
  int PACKET0 = UnsignedByteOps(packet[0]).toUint();
  int PACKET1 = UnsignedByteOps(packet[1]).toUint();
  return (int[])Array..MODULE$.apply(Predef..MODULE$.wrapIntArray(new int[] {
    UnsignedByteOps(packet[2]).toUint() | PACKET0 << 6 & 0x300, 
    UnsignedByteOps(packet[3]).toUint() | PACKET0 << 8 & 0x300, 
    UnsignedByteOps(packet[4]).toUint() | PACKET1 << 2 & 0x300, 
    UnsignedByteOps(packet[5]).toUint() | PACKET1 << 4 & 0x300, 
    UnsignedByteOps(packet[6]).toUint() | PACKET1 << 6 & 0x300, 
    UnsignedByteOps(packet[7]).toUint() | PACKET1 << 8 & 0x300 }), ClassTag..MODULE$.Int());
}

Instead of the Java array I expected, some Scala internals are used to create the array and then converting it into the Java thing. Possibly this was the cause of the slowness. By looking around at the decompiled code I found another function that just used a Java int array. So I rewrote the unrolled version accordingly –

private def decodeUnrolled1( packet: Array[Byte] ) : Array[Int] = {
    val samples = new Array[Int](SAMPLES_PER_BLOCK)
    val PACKET0 = packet(0).toUint
    val PACKET1 = packet(1).toUint
    samples(0) = packet(BLOCK_HEADER + 0).toUint | ((PACKET0 << 6) & 0x300)
    samples(1) = packet(BLOCK_HEADER + 1).toUint | ((PACKET0 << 8) & 0x300)
    samples(2) = packet(BLOCK_HEADER + 2).toUint | ((PACKET1 << 2) & 0x300)
    samples(3) = packet(BLOCK_HEADER + 3).toUint | ((PACKET1 << 4) & 0x300)
    samples(4) = packet(BLOCK_HEADER + 4).toUint | ((PACKET1 << 6) & 0x300)
    samples(5) = packet(BLOCK_HEADER + 5).toUint | ((PACKET1 << 8) & 0x300)
    samples
}

The main difference is that here the array is first declared and allocated, then filled. In the above code the array was created, initialized and returned all in the same statement.

The speed improvement was good, but not much better than my imperative version: x7.77 times faster.

Then a colleague pointed out that I was using the “old” Scala 2.10 compiler and that I should try the latest 2.12 that benefits from a better interoperability with the underlying JVM 1.8.

In the following table you get the performance comparison:

Attention: The internal data of table “1” is corrupted!

Functional and unrolled0 are quite unsurprising, just what you expect from the next version of the compiler. Imperative approach yields quite a boost – 22 times faster than the functional version on the previous compiler. The shocking surprise is in the last column – the unrolled1 version of the code runs in excess of 1200 times faster than the original code!

Before jumping to conclusions, I performed the same tests on the code translated into C++. Here is the equivalent code:

typedef std::array<int,SAMPLES_PER_BLOCK> OutputArray;
typedef std::array<uint8_t,BLOCK_LENGTH> InputArray;

OutputArray decodeImperative( InputArray const& packet )
{
    OutputArray samples;
    uint32_t msbs = ((packet[0] << 8) | packet[1]) & 0xFFF;
    for( int index=0; index< SAMPLES_PER_BLOCK; ++index )
    {
        samples[index] = packet[BLOCK_HEADER + index] |
                         (((msbs << 2 * index) & 0xC00) >> 2);
    }
    return samples;
}

OutputArray decodeFunctional( InputArray const& packet )
{
    OutputArray lowSamples;
    std::copy( packet.begin()+2, packet.end(), lowSamples.begin() );
    uint32_t msbs = ((packet[0] << 8) | packet[1] ) & 0xFFF;
    OutputArray samples;

    int index=0;
    OutputArray::const_iterator scan = lowSamples.begin();
    std::generate(
        samples.begin(),
        samples.end(),
        [&index,&scan,msbs]()
        {
            return *scan++ | (((msbs << 2*index++) & 0xC00) >> 2);
        }
    );
    return samples;
}

OutputArray decodeUnrolled0( InputArray const& packet )
{
    uint8_t PACKET0 = packet[0];
    uint8_t PACKET1 = packet[1];

    return OutputArray{
        packet[BLOCK_HEADER+0] | ((PACKET0 << 6) & 0x300),
        packet[BLOCK_HEADER+1] | ((PACKET0 << 8) & 0x300),
        packet[BLOCK_HEADER+2] | ((PACKET1 << 2) & 0x300),
        packet[BLOCK_HEADER+3] | ((PACKET1 << 4) & 0x300),
        packet[BLOCK_HEADER+4] | ((PACKET1 << 6) & 0x300),
        packet[BLOCK_HEADER+5] | ((PACKET1 << 8) & 0x300)
    };
}

OutputArray decodeUnrolled1( InputArray const& packet )
{
    OutputArray samples;
    uint8_t PACKET0 = packet[0];
    uint8_t PACKET1 = packet[1];

    samples[0] = packet[BLOCK_HEADER+0] | ((PACKET0 << 6) & 0x300);
    samples[1] = packet[BLOCK_HEADER+1] | ((PACKET0 << 8) & 0x300);
    samples[2] = packet[BLOCK_HEADER+2] | ((PACKET1 << 2) & 0x300);
    samples[3] = packet[BLOCK_HEADER+3] | ((PACKET1 << 4) & 0x300);
    samples[4] = packet[BLOCK_HEADER+4] | ((PACKET1 << 6) & 0x300);
    samples[5] = packet[BLOCK_HEADER+5] | ((PACKET1 << 8) & 0x300);
    return samples;
}

I’m positive that functional purist will scream in horror at the sight of C++ functional version of the code, but it is the closest thing I could do using the standard library. Should you have any idea on how to improve the functional idiom using the standard library, let me know and I’ll update my test set.

Here are the C++ results –

Attention: The internal data of table “2” is corrupted!

This is mostly unsurprising but for the the lower right column which seems to indicate that Scala version is faster than C++. I guess that the reason is because I am using timing values that are rebased considering a standard price you pay for executing the code under test and that I don’t want to consider in my measurements (see below about the methodology I used). My interpretation is that the overhead in Scala hides the cost for a trivial expression so that the base time is much more close to the execution time than it is in C++.

Conclusions

Drawing general conclusions from a specific case is all but easy and trivial. Anyway according to data and to intuition, if you want to code using the functional idiom you pay a price that sometimes can be hefty. Paying an execution price, be it memory or cpu cycles, to raise the abstraction level is something that has been with computer programming from the dawn of this craft. I remember echoes from the ranting against high level languages in favor of assembly to avoid paying a 30% penalty. Now those times are gone, but facing a x4-x7 overhead can be something not always you can afford.

(Raising the level of abstraction by employing the functional idiom is something that is not undisputed, but I will spare this for another post).

Writing efficient code is not always intuitive. Constructs that seem innocent enough turns out to be compiled into time consuming lower level code. Peeking at compiled code is always useful. While languages as C++ are accompanied by detailed information about performances of various constructs and library components, Scala (as Java) lacks of these information. Also the huge amount of syntactic sugar employed by Scala does not simplify the life of the programmer that has to optimize code.

Eventually I cannot save to remarks the performances of interpreted languages (and even more so functional interpreted languages), despite of the progress in the JVM technologies, are still way behind native language ones. Even employing a functional approach C++ is some 80 times faster than Scala. You can argue that the C++ code above is not really functional, but there is nothing preventing me to use a mostly functional library for C++, with roughly the same performances of the functional version of the C++ code. It could be more tedious to write because of the lesser grade of syntactic glucose in C++, yet it is still functional.

I’m curious to see how scala-native compares, but this again is a topic for another post.

The last thought I want to report here is that you can make Scala code much faster, by avoiding the functional paradigm and peeking at compiled code. It is something you are not usually expected to do, but and improvement of 1200 times is a worth gain for the most executed code.

A very last thought for those who think that after all nowadays we have such powerful computers that analyzing performances and optimizing is a waste of time. I would like to remember that data processing requires energy and today energy production mostly relies on processes that produce CO2, something that the world doesn’t need. A data processing that runs just twice the speed of another will produce half of the carbon dioxide emissions. The environment benefits for switching to C and C++ should be clear to everyone.

Methodology

A couple of words on methodology. The compilers were Scala 2.10.6, Scala 2.12.2 and GNU c++ 7.3.1. All the test were performed on Linux (Fedora Core 27) using the command /bin/time –format=”%U %S”  command to print the User and Kernel times of the command and summing the two values together. These are times that CPU spent executing the process, not wall clock. The decoding function is executed 0x7FFFFFFF (some 2 billions) times per run and each program is tested for 12 runs. The fastest and the slowest run are discarded and the other are used to compute an average that is considered the measure of the test.

The base time is computed by employing a dummy decoder that does nothing but filling the result array with constants.

This setup should take in account the JIT effect and measure just the code I listed above.