++it18 – Zero Allocation and No Type Erasure Futures

This talk by Vittorio Romeo tackles a subject that could go ignored if your programming style doesn’t feel the need for parallelism (or if it does, pthreads satisfies such a need). Futures have been introduced in C++ with 2011 ISO standard, they hardly can be deemed as novelty.

I’ve been hit by futures some 4 years ago when confronted with a large code base in Scala that exploited futures and actors. So, at least to save my ego, there is nothing wrong if you don’t know what a future is :-).

In brief a future is a way to model a concurrent operation. The future can be either in execution or succeeded with a result value, or failed with an error. As user, you request an async operation and you get a future of the expected result. You can set callbacks on the completion, or combine several futures together either by chaining or by synchronize them. As you may imagine futures vastly simplify the handling of async computation by hiding the synchronization and communication details and by providing a common and clear usage pattern.

So, no excuse, if you need to do async programming, you’d better know this tool.

Type Erasure is an interesting concept, possibly rooted somewhere in type theory, but I suspect that just arose to the news just for practical needs. To picture it in my mind, I re-elaborate the theory as follows. Generic programming in C++ is achieved by templates. (Simplifying) a template accepts a type as a argument and then generates specific code by replacing such a type in the template body everywhere the type is referenced. By this definition a template is not so different from a macro and sports the interesting feature that you can apply the template to any type as long as such type supports the methods and the operators used in the template. In other words, given the code:

template<typename T>
f( T const& x )
{
    std::cout << "Hello " << x.toString() << "
";
}

Compiles and works for all different T`s that have method toString() that returns something for which operator<< into std::ostream is defined. All these Ts do not need to inherit from a common base class with a toString()  method or any other constraint.

Java (and Java-like languages) uses a different approach. The template code is not rewritten for type T, but downcasted/upcasted as needed to keep the same code and apply it to user types. By this approach the template type argument needs to inherit from a common base class that has all the operations used in the template body. This approach is called type-erasure (Now, this conflicts with the explanation I just gave, since conceptually typing is ignored in C++ approach and retained, to an extent, in Java approach… this possibly means that I missed something :-)).

So, time to let Vittorio explain what he did. His slides are here, while you can find the entire video of the speech here (#include <stddisclaim> i.e. errors and inconistencies are mine, not Vittorio’s)


My goal is to build chain of asynchronous computation without incurring in the overhead of type erasure and/or dynamic allocation. To reach this goal, the compiler must encode the type of the stream in the template instantiation. E.g.:

auto graph = leaf{[]{return "hello"; }}.then( [] (string x ){ return x+ " world"; } );

This is a composition of two futures that prints hello world . The return type – hidden by the auto  keyword – is a template that encodes a graph that describes the execution. Once formatted in a pretty way, the template type looks like:

node::seq<
    node::leaf<
        void,
        (lambda#0)
    >,
    node::leaf<
        std::string,
        (lambda#1)
    >
>

Another case for using future is the composition of two time consuming operations:

auto graph = all {
[] { return http_get_request( "animals.com/cat/0.png" );},
[] { return http_get_request( "animals.com/dog/0.png" ); },
}.then( [](std::tuple<data,data> t){ /*...*/ });

Two concurrent HTTP GET operations are issued and the resulting data is combined into the .then statement.

Again the structure of the  composition can be inferred from the template type:

node::seq<
    node::all<
        node::leaf<void,(lambda#0)>,
        node::leaf<void,(lambda#1)>
    >
    node::leaf<std::tuple<data,data>,(lambda#2)>
>

Examining the current C++ standard, you find the std::async function that takes a function to run asynchronously and returns a std::future yielding the result of the function. If no explicit wait is invoked, then the destructor waits for the end of the asynchronous function. E.g.:

auto f=std::async( std::launch::async, []
{
    std::this_thread::sleep_for( 100ms );
    std::cout << "world!";
}
std::cout << "Hello, ";

When you have simple functions to run concurrently, with no dependencies, this method is really convenient. E.g.:

auto f0 = std::async(...);
auto f1 = std::async(...);
auto f2 = std::async(...);

Eventually all three fs will complete and the execution will terminate. When dependencies exist among concurrent operations the std::async/std::future no longer offer a viable solution.

Picture a case where you have several asynchronous operation to combine in order to produce the desired result – some of them runs in parallel, other in sequence and you may have a complex graph to represent the data-flow:

            +----+
        .-->| f1 |--.
       /    +----+   
+----+/     +----+     +----+
| f0 |----->| f2 |----->| f4 |
+----+     +----+ ___/ +----+
           +----+/     +----+
        `-->| f3 |----->| f5 |
            +----+      +----+

In fact to compose sequences (i.e. do f0, then f1) you need to nest std::async calls:

auto f0 = std::async( std::launch::async, []
{
    std::cout << "hello ";
    auto f1 = std::async( std::launch::async []
    {
        std::cout << "world
";
    });
)

Another typical case of use is to collect the result of multiple futures for a subsequent computation:

auto a0 = std::async(std::launch::async,[]{ std::cout << "a0
";});
auto s1 = std::async(std::launch::async,[]{ std::cout << "a1
";});
auto a2 = std::async(std::launch::async,[]{ std::cout << "a2
";});

a0.get();
a1.get();
a2.get();

auto b0 = std::async(std::launch::async,[]{ std::cout << "b0
"; });

We want intuitive abstraction to represent future composition. Currently there are some libraries that offer this feature: boost::future and std::experimental::future (“Extension for concurrency” TS). ISO C++ TS has been contributed by Anthony Williams. You can check his talk on Concurrency, Parallelism and Coroutines at cppcon 2017.

In boost the concatenation is simply performed with the following structure:

boost::async( boost::launch::async,
    []{
        std::cout << "hello ";
    }
).then(
    [](auto){
        std::cout << "world
";
    }
);

auto means that the result of the first future is passed to the second one.

Watiting for multiple futures is performed using boost::when_all( ... ) function, as shown in the following code:

boost::when_all(
    boost::async( boot::launch::async, []{ std::cout << "a0
"; });
    boost::async( boot::launch::async, []{ std::cout << "a1
"; });
    boost::async( boot::launch::async, []{ std::cout << "a2
"; });
).then([](auto){ std::cout << "b0
"; } );

Composition is easy and natural. Primitives are when_all, when_any and then. Each of them returns a future that can be further composed.

So, what’s wrong with that?

Two things mainly. First the result is always a future<T>  that means that type is erased. In other words you can compose futures of the same type T:

boost::future<int> a = ....
boost::future<int> b = a.then( ... )

But you can’t mix future of different types.

The other drawback is that boost implementation keeps some shared state in a dynamically allocated memory.

Can we do better? And, more important, is it worth?

Type erasure is necessary because the way that this future composes changes depending on run-time control flow.
If the shape of the future computation graph could be set at compile time, the graph could be encoded using the type system.

Let’s sketch down something that could work in this direction. The could you would like to write could be something like:

auto f0 = leaf{ [] { std::cout << "a"; }};
auto f1 = f0.then( []{ std::cout << "a"; });

In order to do that, you need to define the leaf template and the operation then that works on it:

template<typename F>
leaf<F>::leaf<F&&>

template<typename F>
template<typename FThen>
sequential<leaf<F>, leaf<FThen>> leaf<F>::then( FThen f_then );

The other relevant problem is to execute a set of task in parallel and wait for all their result before moving on.

auto f0 = when_all( [] { std::cout << "a0"; },
                    [] { std::cout << "a1"; },
                    [] { std::cout << "a2"; });
auto f1 = f0.then( [] { std::cout << "b0"; } );

Using varidic arguments for template, it is possible to encode this syntax as well:

template <typename ... Fs>
parallel<leaf<Fs> ...> when_all( Fs... fs );

The template will encode the type of f1 as:

sequential<
    parallel<
        leaf<lambad#0-2>
        ...
    >,
    leaf<lambda#1>
>

Now that we managed to store the structure of the future, let’s take a look at what can be done for type erasure. std::future<T> always uses dynamic allocation to store the shared state.

I.e. a promise / future pair has a connection that keeps track of the promise realization and future completion. This connection is reference counted and is shared between connected promises and futures.

Since we know the shape of the final graph – thanks to the type system -, why don’t we store the state in the final object? Using this approach it sufficies that the graph object must be kept alive until execution is completed.

So if the

scheduler.execute(f);

call, starts the asynchronous execution of the graph f , then we need to guarantee that f must be alive until the end of computation.

State is spread in graph nodes which are allocated on the stack when needed. This allows us to avoid dynamic allocation and shared pointers altogether.

Now that the overall design is laid out, we can start the implementation. I will present:

  • leaf node
  • seq node (for sequential chaining)
  • all node (for parallel execution of predecessors)
  • return value propagation
  • blocking execution
  • syntax for continuation like (.then)
  • any node (for when any composition)

All nodes need to implement the concept below. A concept is an interface that needs to be implemented by a type in order to be accepted as a parameter of a given template.

All node types need to implement:

template<typename Scheduler,typename Then>
void /* node */::execute( Scheduler& scheduler, Then&& then ) &
{
    // executes stored computation, then continue execution via then
}

Later this will be improved by enabling the propagation of return values. Note the & method qualifier. This is the key to keep the graph alive until it will be completly executed.

The scheduler argument is the object that provides the execution time and context. It can be as simply as

std::thread{f}.detach()

Now let’s sketch out the idea for a leaf. A leaf purpose is to simply wraps a single computation. It exposes execute, but won’t make use of scheduler. A leaf is used like in the following example:

leaf l{[]{std::cout << "Hello "; }
l.execute( scheduler, []{ std::cout << "world
";} );

Note that this is possible thanks to the C++ class template argument deduction (basically you can omit the template type argument if they can be deduced from the constructor.)

Leaf template is defined as follows

template<typename F>
struct leaf : F
{
    leaf( F&& f ) : F{ std::move( f )}
    {}

    template <typename Scheduler, typename Then>
    void execute( Scheduler&, Then&& then ) &
    {
        (*this)();
        std::forward<Then>(then);
    }
};

Thanks to the class template argument deduction when you write:

leaf l{some_lambda};

The compiler turns it implicitly into:

leaf<decltype(some_lambda)> l(some_lambda);

The seq node takes two nodes and executes the first and then the seconds:

a = lambda#0
b = lambda#1

seq s0{ std::move(a), std::move(b) };
s0.execute(...);

Here it goes the source code:

template<typename A, typename B>
struct seq : A, B
{
    seq( A&& a, B&& b ) : A{ std::move(a) }, B{ std::move(b) }
    {
    }

    template<typename Scheduler, typename Then>
    execute(Scheduler& scheduler, Then& then) &
    {
        A::execute( scheduler, [this, &scheduler, then]
        {
            B::execute( scheduler, then );
        });
    }
}

As you can see, A is executed at the start of execute method, when complete, B is executed since is passed as then argument to execute.

Note that this form allows non-blocking asynchronous composition: then is parked in the stack until the execution of A is complete.

all  takes any number of input, executes them in parallel and continues only when all of them are completed. It is used this way:

all graph{
    leaf{[]{ std::cout << "a0
"; }},
    leaf{[]{ std::cout << "a1
"; }},
    leaf{[]{ std::cout << "a2
"; }},
};

graph.execute( scheduler, []{ std::cout << "b0
"; });
std::this_thread::sleep_for(100ms);

For this to work, the all node must keep track of which activities are still executing and which are ongoing. An atomic counter counts the number of activities still running. The implementation is like this:

template<typename ... Fs>
struct all : Fs ...
{
    std::atomic<int> _left;

    all( Fs&& ... fs) : Fs{ std::move(fs)}...
    {}

    template<typename Scheduler,typename Then>
    void execute( Scheduler& scheduler, Then&& then) &
    {
        _left.store( sizeof ... (Fs));
        (scheduler.execute([this,&scheduler,&f = static_cast<Fs&>(*this), then]
        {
            f.execute( scheduler, [this, then] 
            {
                if( _left.fetch_sub(1) == 1 )
                {
                    then();
                }
            });
        },
        ...);
    }
};

First note that inheritance is used here not only for Empty Base Optimization, but also because it makes convenient to work on the fold expression.

Note that the execute revolves around the comma operator used in the fold expression. Each leaf is executed using the scheduler and in the then part, the counter is decremented.

Since the counter is initialized to the number of sub-task, when it reaches one, then it means that this is the last task and the then part of the all can be executed.

So now that we have our definition of futures, let’s compare it with other implementations. [NdM presentation slides include a number of graphs I couldn’t copy directly into this, I invite you to explore the original slides by Vittorio. I will continue with text brief description of the comparison results].

Some benchmarking against boost::future shows that when computations are short, then the lack of type erasure and dynamic allocations, are important (2 orders of magnitude). Otherwise, it doesn’t matter much.

Also the more chaining is done, the better this solution performs.

1ms seems to be the maximum time for which an advantage is still present for this solution.
You can find the source code on github.

One thought on “++it18 – Zero Allocation and No Type Erasure Futures

  • I just noticed that all listings have been messed up. Something is wrong with the wordpress plugin I’m using for syntax highlight. No clue about what’s wrong, so I temporarily disabled the plug-in. Sorry for the inconvenience.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.