Taskflow
2.4-master-branch
|
Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapters discusses how to implement heterogeneous decomposition algorithms using CPU-GPU collaborative tasking.
Taskflow enables concurrent CPU-GPU tasking by leveraging CUDA Graph. The tasking interface is referred to as cudaFlow. A cudaFlow is a graph object of type tf::cudaFlow created at runtime similar to dynamic tasking. It manages a task node in a taskflow and associates it with a CUDA Graph. To create a cudaFlow, emplace a callable with an argument of type tf::cudaFlow. The following example implements the canonical saxpy (A·X Plus Y) task graph.
Debrief:
hx
and hy
dx
and dy
dx
and dy
on device, each of N*sizeof(float)
bytes Taskflow does not expend unnecessary efforts on kernel programming but focus on tasking CUDA operations with CPU work. We give users full privileges to craft a CUDA kernel that is commensurate with their domain knowledge. Users focus on developing high-performance kernels using a native CUDA toolkit, while leaving difficult task parallelism to Taskflow.
Use nvcc (at least v10) to compile a cudaFlow program:
Our source autonomously enables cudaFlow when detecting a CUDA compiler.
By default, the executor spawns one worker per GPU. We dedicate a worker set to each heterogeneous domain, for example, host domain and CUDA domain. If your systems has 4 CPU cores and 2 GPUs, the default number of workers spawned by the executor is 4+2, where 4 workers run CPU tasks and 2 workers run GPU tasks (cudaFlow). You can construct an executor with different numbers of GPU workers.
The above executor spawns 17 and 8 workers for running CPU and GPU tasks, respectively. These workers coordinate with each other to balance the load in a work-stealing loop highly optimized for performance.
You can run a cudaFlow on multiple GPUs by explicitly associating a cudaFlow or a kernel task with a CUDA device. A CUDA device is an integer number in the range of [0, N)
representing the identifier of a GPU, where N
is the number of GPUs in a system. The code below creates a cudaFlow that runs on the GPU device 2 through my_stream
.
You can place a kernel on a device explicitly through the method tf::cudaFlow::kernel_on that takes the device identifier in the first argument.
Debrief:
dx
and dy
blocks by CPU z1
z2
max_error
should be zero)Running the program gives the following nvidia-smi snapshot in a system of 4 GPUs:
Even if cudaFlow provides interface for device placement, it is your responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU 2 using cudaMalloc
and access it from a kernel on GPU 1. A safe practice is to allocate unified memory blocks using cudaMallocManaged
and let the CUDA runtime perform automatic memory migration between processors (as demonstrated in the code example above).
As the same example, you may create two cudaFlows for the two kernels on two GPUs, respectively. The overhead of creating a kernel on the same device as a cudaFlow is much less than the different one.
cudaFlow provides a set of methods for users to manipulate device memory data. There are two categories, raw data and typed data. Raw data operations are methods with prefix mem
, such as memcpy
and memset
, that take action on a device memory area in bytes. Typed data operations such as copy
, fill
, and zero
, take logical count of elements. For instance, the following three methods have the same result of zeroing sizeof(int)*count
bytes of the device memory area pointed by target
.
The method cudaFlow::fill is a more powerful version of cudaFlow::memset. It can fill a memory area with any value of type T
, given that sizeof(T)
is 1, 2, or 4 bytes. For example, the following code sets each element in the array target
to 1234.
Similar concept applies to cudaFlow::memcpy and cudaFlow::copy as well.
You can create a cudaFlow once and launch it multiple times using cudaFlow::repeat or cudaFlow::predicate, given that the graph parameters remain unchanged across all iterations.
The executor iterate the execution of the cudaFlow until the predicate evaluates to true
.
Creating a cudaFlow has certain overhead, which means fined-grained tasking such as one GPU operation per cudaFlow may not give you any performance gain. You should aggregate as many GPU operations as possible in a cudaFlow to launch the entire graph once instead of separate calls. For example, the following code creates the saxpy task graph at a very fine-grained level using one cudaFlow per GPU operation.
The following code aggregates the five GPU operations using one cudaFlow to deliver much better performance.
We encourage users to study and understand the parallel structure of their applications, in order to come up with the best granularity of task decomposition. A refined task graph can have significant performance difference from the raw counterpart.