Choosing the right threading framework

This is the second article in a series of articles about High Performance Computing with the Intel Xeon Phi. The Intel Xeon Phi is the first commercial product of Intel to incorporate the Many Integrated Core architecture. In this article I will present various frameworks for unleashing the power of multiple threads on the Xeon Phi. We will also have a look at interesting properties and advantages / disadvantages of each framework.

Possible frameworks

There is huge number of possible frameworks, language extensions and libraries. Most of are building upon Pthreads, which is the abbreviation for POSIX Threads. PThreads is the available API of POSIX-conformant operating systems like Linux and therefore the obvious door to threading on the software layer. Windows is not a POSIX-conformant OS, however, even there we could use Pthreads, which exists as a wrapper around the Windows threads API.

Naturally two choices, which are usually non-exclusive, arise:

Using a framework / library for threading.
Using language or compiler extensions.

The first category provides options such as Intel Threading Building Blocks (also known as TBB), Boost or an MPI implementation. The second category contains OpenMP, Intel Cilk Plus or OpenACC. In this article we will discuss the following options:

Intel TBB
Intel Cilk Plus
OpenMP
C++11 Threads (similar to Boost)
Using Pthreads directly

We will not discuss using MPI, since MPI is not directly a threading framework. Even though MPI also works with multiple threads, it provides a substantial overhead compared to the threading frameworks discussed here and should therefore not be the primary choice for doing multi-threading. OpenACC on the other hand is omitted from the discussion, since Intel does not have any plans of supporting it. While it might work just fine there are no benefits of using it (as compared to using compiler offload directives with some specialized threading framework).

In the next sections we will present each threading framework. The presentation includes a little bit of background knowledge, a quite simple program and some of the most important properties of the framework. The focus will lie on usability, performance and flexibility.

Intel Threading Building Blocks

Intel TBB is a C++ template library for developing software that takes advantage of multi-core processors. The library contains various data structures and implemented algorithms that allow us to avoid some of the complications of using lower-level APIs. Complications in creating, synchronizing, or terminating the threads manually are therefore deprecated.

Instead Intel TBB abstracts access to the multiple processors by allowing the desired operations to be treated as "tasks", which might then be allocated to individual cores dynamically. An important factor here is the efficient use of the CPU cache. A TBB program creates, synchronizes and destroys graphs of dependent tasks according to some implemented algorithms. The tasks are then executed in respect to their graph dependencies, which results in a decoupling the programming from the particulars of the underlying machine.

Now we will actually have a look at some programming code. We are using a really simple example of a program, which wants to compute a sum depending on an external parameter. We do not care about the fact that the program is actually much faster in sequential execution, since we are only interested in some of the details of the framework.

#include <stdio.h>
#include <stdlib.h>
// The corresponding TBB header
#include "tbb/tbb.h"

// All functions and objects have been defined in the namespace tbb
using namespace tbb;

int main(int argc, char **argv)
{
	int add = argc > 1 ? atoi(argv[1]) : 1;

	// Computing a sum in parallel fits perfectly to the tbb::parallel_reduce function
	int sum = parallel_reduce(
		blocked_range<int>(0, 100),
		0,
		[&add](const blocked_range<int>& r, int local)->int
		{
			for(int i = r.begin(); i != r.end(); i++)
			{
				local += i + add;
			}

			return local;
		},
		[](int x, int y)->int
		{
			return x + y;
		});

	printf("Result = %dn", sum);
	return 0;
}

Here we are using a C++11 lambda expression to pass the function to be invoked to the parallel_reduce function. For us this yields the advantage that we do not need to create a whole class and overload the function call operator.

TBB then requires a kind of special variable of type blocked_range. This variable will be used for partitioning the loop. In TBB we have to manually invoke the inner part of the loop, which is where we get the partitioner again. This seems like additional work initially, however, we will eventually see that this actually gives us more flexibility and performance.

A special feature of this reduction function is the local container. We initialize it with the value 0 and pass it to the function that will be called by the partitioner. Then we return an updated value, which will be stored by the partitioner. Finally the partitioner will call the last lambda expression. This anonymous function just sums up the single elements. We see that here no synchronization is going on, which is allowed since TBB will perform this part in sequential order.

We can already see the strength and weakness of Intel TBB. On one hand we have direct access to a lot of important parts of a parallel algorithm, however, on the other hand the resource management (in this case the overall partitioning, thread-creation and management) is completely hidden.

If we want to run this piece of code we have to additionally set up our environment for running with the external dependency which is in this case Intel TBB. Setting it up can be done with the following command line statements:

SINK_LD_LIBRARY_PATH=$SINK_LD_LIBRARY_PATH:$TBBROOT/lib/mic
export SINK_LD_LIBRARY_PATH=$SINK_LD_LIBRARY_PATH

(Note that $TBBROOT is probably set when loading the MIC software stack, otherwise you need to find out what the correct path for TBB is and set it.)

It is important to link not only with libtbb but also with libpthread in the compilation process, since Intel TBB is internally using Pthreads as a static library. The following should therefore work just fine (assuming that $TBBROOT is set as specified above):

icpc -Wall -mmic -o hello-tbb hello-tbb.cc -L$TBBROOT/lib/mic -ltbb -lpthread

Running the code should give the result 5500, since in case of no argument the add variable will be set to 1, which makes the code run the "little Gauss" (or Gauss sum formula) for all numbers from 1 to 100.

Intel Cilk Plus

Intel Cilk Plus is an extension to the C/C++ programming language, which has been designed for multi-threading. The original simplifications have been proposed by Cilk Arts for the Cilk++ extension to the C/C++ programming language. This eliminated the need for several of the original Cilk keywords while adding the ability to spawn functions and to deal with variables involved in reduction operations.

Additionally Intel Cilk Plus differs from Cilk/Cilk++ by adding array extensions and compatibility with existing debuggers. Even though Intel has stated its desire to refine Cilk Plus and enable it to be implemented by other compilers the language extension is only implemented in GCC (additional to the the Intel compiler). However, this implementation is currently only maintained by Intel and has not been included yet in any officially released version. One of the advantages of using Cilk Plus the guaranteed maximum memory usage scaling. (Note: Apparently also a special Clang / LLVM version exists as well, which is publicly available at GitHub.)

Again we are using the same algorithm as before. Cilk Plus does not contain a direct method to do a reduction, like Intel TBB had, however, it has a more general way of doing loops and comes with an additional definition of so-called reduction types.

#include <stdio.h>
#include <stdlib.h>
// This optional header enables cilk_for instead of _Cilk_for etc.
#include <cilk/cilk.h>
// This header gives us the reduction types
#include <cilk/reducer_opadd.h>

//The reduction types use a namespace called cilk
using namespace cilk;

int main(int argc, char **argv)
{
	int add = argc > 1 ? atoi(argv[1]) : 1;
	// Create such a reduction type instance
	reducer_opadd<int> sum;

	// _Cilk_for automatically does everything for us
	_Cilk_for(int i = 0; i < 100; i++)
	{
		//The reduction type takes care of race conditions
		sum += i + add;
	}

	// The get the value we have to call the get_value function
	printf("Result = %dn", sum.get_value());
	return 0;
}

This is quite huge code reduction here compared to Intel TBB. Again the explicit resource management is hidden, however, this is just the case with the _Cilk_for instruction. Cilk Plus has three instructions (in short notation):

cilk_spawn for creating a new thread with the given function.
cilk_sync for joining all threads that have been spawned at the current level.
cilk_for for partitioning some work in a loop.

Obviously the extension has been trimmed on those two things. Additionally Cilk Plus has very good vector extensions, which makes Cilk Plus even more interesting. However, we will not have a look at those in this tutorial - but in the next.

Compiling a program that only uses Cilk Plus is trivial. We do not need to specify any additional external dependencies or libraries. The compiler (in case of the Intel Compiler) already comes with everything that is needed to compile a program that has been decorated with Cilk Plus instructions.

OpenMP

OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. In short, OpenMP is an international standard for that API, which has to be implemented by the compiler. It is therefore quite similar to MPI, which is also an international standard that has to be implemented in order to use it practically. In OpenMP multi-threading is a method of parallelizing whereby a master thread forks a specified number of slave threads and a task is divided among them.

The forked threads then run concurrently, with the runtime environment allocating them to different processors. The section of code that is meant to run in parallel is marked with a preprocessor directive that will cause the compiler to insert the required instructions to form threads before the section is executed. After the execution of the parallelized code, the threads join back into the master thread, which then continues in sequential mode again.

By default, each thread executes the parallelized section of code independently. Work-sharing constructs can be used to divide a task among the threads so that each thread executes its allocated part of the code. Both task parallelism and data parallelism can be achieved using OpenMP since version 3 of this standard.

Now it's time to look again at some code. This time we will see that the code looks almost like a sequential version of the algorithm. The only difference is the inserted pragma directive.

#include <stdio.h>
#include <stdlib.h>
// The optional header to get access to special omp_ functions 
#include <omp.h>

int main(int argc, char **argv)
{
	int add = argc > 1 ? atoi(argv[1]) : 1;
	int sum = 0;

#pragma omp parallel for reduction(+: sum)
	for(int i = 0; i < 100; i++)
	{
		sum += i + add;
	}

	printf("Result = %dn", sum);
	return 0;
}

Any OpenMP pragma starts with #pragma omp. The instruction parallel tells OpenMP to extract the following code-block as a function that will be called from multiple threads. Here we are also using the keywords for and reduction. The first one results in splitting the work inside the next for-loop between multiple threads (otherwise every thread would do the same work, i.e. the full for-loop). The latter gives OpenMP the hint that we are actually using a special kind of reduction - an addition on the global variable sum.

The beauty of OpenMP lies in three things:

The decorations are non-obtrusive, i.e. without OpenMP we would just have a sequential program.
The exact resource management is again hidden, but we can control / influence it.
It is very easy to alter existing sequential programs by just decorating the critical parts.

So we have a highly portable standard, that works across many languages (e.g. C/C++, Fortran) and provides many options as well.

C++11 Threads

While the C++11 language provides a memory model that supports threading, the primary support for actually using threading comes with the C++11 standard library. This library provides everything from thread management to mutexes and synchronization. For high performance (low-level work), however, it is sometimes necessary to communicate between threads without the overhead of mutexes. This is achieved using atomic operations on memory locations.

These atomic operations can optionally specify the minimum memory visibility constraints required for an operation. Explicit memory barriers may also be used for this purpose. The new std::async facility provides a convenient method of running tasks and tying them to a std::future, which follows the promise pattern.

The user can choose whether the task is to be run asynchronously on a separate thread or synchronously on a thread that waits for the value. By default, the implementation can choose, which provides an easy way to take advantage of hardware concurrency without oversubscription, and provides some of the advantages of a thread pool for simple usages.

Out of the box we do not have any parallel loop algorithms such as _Cilk_for or #pragma omp parallel for. Therefore we would have to create our own implementation. The algorithm uses a straight-forward implementation that does not consider the work-load or efficiency. We only care about the work being done and some features of the threading library in general.

#include <stdio.h>
#include <stdlib.h>
// This header includes the definition for the C++11 thread
#include <thread>
// This header includes the atomic types such as atomic_int
#include <atomic>

// As with all STL contents everything is contained in std
using namespace std;

int main(int argc, char **argv)
{
	int add = argc > 1 ? atoi(argv[1]) : 1;
	// Initialize atomic sum
	atomic_int sum = 0;
	int num_threads = 5;
	int step = 100 / num_threads;
	// Array with threads
	thread threads[num_threads];

	for(int i = 0, k = 0; i < 100; i+=step)
	{
		// Create the threads (manually) with an anonymous function as entry point
		threads[k++] = thread([&sum, &add](int start, int end)-> void
		{
			for (int i = start; i < end; ++i)
			{
				sum += i + add;
			}
		}, i, i + step);
	}

	// We should join the threads again (wait for their finish)
	for (int i = 0; i < num_threads; ++i)
	{
		threads[i].join();
	}

	printf("Result = %dn", sum.load());
	return 0;
}

This is a big contrast to the two previously discussed frameworks: we need to do all the work by ourselves. However, the code is still quite small, since we can make use of C++11 lambda expressions with capturing local variables (sum and add are captured). The big advantage of the C++11 threading library is that it is very much similar to the threading part of the Boost library, however, it makes use of some of the new C++11 features such as threading and variadic templates.

Pthreads

Pthreads defines a set of C programming language types, functions and constants. It is implemented with a pthread.h header and a thread library. In the world of High Performance Computing, the primary motivation for using Pthreads is to realize potential program performance gains. When compared to the cost of creating and managing a process, a thread can be created with much less operating system overhead.

Managing threads requires fewer system resources than managing processes. The subroutines which comprise the Pthreads API can be informally grouped into four major groups: Thread management, Mutexes, Condition variables and Synchronization. All identifiers in the threads library begin with pthread_.

The main advantage of using Pthreads as compared to the previously presented solutions is that all other solutions are internally (usually) build upon Pthreads. Therefore one might gain performance by using even more specialized (for the architecture or the code) algorithms, which obviously requires a lot of work in general.

Similar to the previously discussed framework we do not have any advanced threading algorithms included. This means we have to split the work of our for-loop manually again. However, a big difference is the treatment of arguments. Pthreads allows a very general pointer type, which has to be casted to something meaningful first. This is a big drawback, since nothing prevents us from doing something very silly here.

The transport of various arguments is therefore given by creating a custom structure, which will then (in this case) be also used for local variables. The advantage of this pattern is that we can access the local variable (in this case a local sum) later on.

#include <stdio.h>
#include <stdlib.h>
// The header with the definitions of all pthread functions
#include <pthread.h>

// Our argument structure
typedef struct
{
	int start;
	int end;
	int add;
	int local = 0;
} args_t;

// The thread entry point function
void* Worker(void* arguments)
{
	args_t* args = (args_t*)arguments;

	for (int i = args->start; i < args->end; ++i)
	{
		args->local += i + args->add;
	}

	// Terminates the calling thread (explicitly)
	pthread_exit(NULL);
}

int main(int argc, char **argv)
{
	int add = argc > 1 ? atoi(argv[1]) : 1;
	int sum = 0;
	int num_threads = 5;
	int step = 100 / num_threads;
	// Array with threads
	pthread_t threads[num_threads];
	// Array with arguments for each thread
	args_t arguments[num_threads];

	for(int i = 0, k = 0; i < 100; i+=step, k++)
	{
		// Setting the specific argument
		arguments[k].start = i;
		arguments[k].end = i + step;
		arguments[k].add = add;
		// Creating the thread
		pthread_create(&threads[k], NULL, Worker, &arguments[k]);
	}

	// We should join the threads again (wait for their finish)
	for (int i = 0; i < num_threads; ++i)
	{
    	pthread_join(threads[i], NULL);
    	// This can also be used to sum up the local results
    	sum += arguments[i].local;
	}

	printf("Result = %dn", sum);
	return 0;
}

Doing multi-threading with Pthreads is certainly the closest method to the hardware, since it is the official API from Linux. However, we see that C++11 threads offers us the same amount of flexibility along with a more type-safe and elegant syntax. Additionally there are several other helpers included, which makes C++11 threads even more superior to plain Pthreads. Nevertheless the big advantage of Pthreads is its independency of C++11, i.e. of compiler issues that might arise. The new C++11 standard along with its extended and improved STL is not fully implemented anywhere, which leaves room for holes or bugs.

A quick assembly comparison

Compiling those sources with the -S option gives us the generated assembly code. Here we already see huge differences. While the sources of C++11 and Intel TBB compile to around 25000 lines of assembly code, the sources of OpenMP and Pthreads compile to around 1500 lines. The source of the example using Intel Cilk Plus reaches 4000 lines. All those numbers have been gathered using the optimization option -O3.

The question now is: What is in there?

The cleanest code is provided by OpenMP. There are several reasons for this. A good indicator is that the C++ code already looked quite similar to the sequential version. Additionally all calls that will be inserted by OpenMP stay external, i.e. they do not end up being in the assembly instructions. That being said we will find the following instructions in the source:

__kmpc_begin and __kmpc_end
__kmpc_global_thread_num
__kmpc_ok_to_fork and __kmpc_fork_call
__kmpc_serialized_parallel and __kmpc_end_serialized_parallel
__kmpc_reduce_nowait and __kmpc_atomic_fixed4_add
__kmpc_for_static_init_4 and __kmpc_for_static_fini

All those function calls lead to an external resource. This is quite the opposite with Intel TBB and C++11. Here only a few external resources are being called, which is why the source is quite big. One factor for this is that both are heavily using templates, which are just creating new functions / classes. This is also the reason why we find a lot of strange function names inside the source.

In the C++11 source we also find keywords for using try/catch like __cxa_begin_catch, __cxa_end_catch or __cxa_rethrow. This means that exceptions from one thread could be prevented from bubbling up. On the other side we also see calls to functions like pthread_cancel or pthread_equal.

This means that under the hood C++11 clearly uses Pthreads. Pthreads on the other side then calls (external) functions like pthread_create or pthread_join, which are the ones we've been using explicitly in Pthreads example program's source code.

Finally we can also have a look at Intel Cilk Plus. This looks very similar to OpenMP, however, the (atomic) reduction is solved differently. Instead of several function calls we used a generic type. All functions and definitions from this type (atomic<int> in our case) have been included in the assembly code. Therefore we only see the very basic function calls:

__cilkrts_hyper_create and __cilkrts_hyper_destroy
__cilkrts_hyper_lookup
__cilkrts_cilk_for_32

Either way the real magic of spawning the threads is hidden, since this is the job of the operating system, which then also manages the threads. There is no easy way around this at user-space, where our programs are running.

Summary

Each option gives us a few advantages and disadvantages. After all we have to know what we need, i.e. what our requirements are exactly. This is the dominant factor for choosing the right multi-threading framework. If we need total control what is happening then C++11 or Pthreads might be the best choices. Here C++11 would drop out instantly if we can't rely on having a modern C++ compiler or if we might need C or Fortran support.

On the other side if we want implicit resource management then Intel TBB might be the right choice. A good cross-platform solution with less overhead and fewer integrated possibilities would be OpenMP. A choice that seems to be in the middle is given by Intel Cilk Plus. Here we only get a minimum amount of features, however, those features are packed into beautiful language extensions, which scale quite well.

The following table gives a short overview of pros and cons of the possibilities we looked at.

OpenMP	Intel Cilk Plus	Intel TBB	C++11 Threads	PThreads
+ Options	+ Performance	+ Possibilities	+ Flexibility	+ Flexibility
+ Portable	+ Scaling	+ Management	+ Type-Safety	+ Low-Level
+ Languages	+ Variables	+ Maintenance	+ Possibilities	+ Compatibility
- Performance	- Fortran	- Language	- Fortran	- Efforts
- Memory	- Control	- OOP	- Compiler	- Type-Safety
- Unreliable	- Possibilities	- Control	- Scaling	- Management

We did not go into specific performance evaluations, since in the end this is strongly application dependent. For instance OpenMP provides quite fast fork and join functions, which enables applications to stay sequential most of the time and only fork in time-critical sections. On the other hand using very low level functions provided by Pthreads will certainly result in applications that do mostly independent work but with very custom and flexible synchronization points.

All in all one should first look at the programming model that fits best. Afterwards a performance evaluation should be done. If the results are sufficient and the framework provides all options one requires, then it seems like a good tool for the job. Otherwise another framework should be chosen.

A final thing to note is that most frameworks are not exclusive, i.e. we could combine the advantages of OpenMP with more flexible barriers that make use of C++11 atomics or similar.

Conclusion

In this article we had a closer look at possible frameworks for managing multi-threading efficiently. We have seen that there are actually a variety of options. Since the final software will be dependent on the framework, choosing the right one for the desired purpose will have a huge impact and the software design and performance.

In the next article we will walk through the very important aspect of vectorization. We will see that mastering the vector unit of the Xeon Phi is crucial for high performance.