Taskflow  2.7.0
C6: CPU-GPU Tasking

Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapters discusses how to implement heterogeneous decomposition algorithms using CPU-GPU collaborative tasking.

Create a cudaFlow

Taskflow enables concurrent CPU-GPU tasking by leveraging CUDA Graph. The tasking interface is referred to as cudaFlow. A cudaFlow is a graph object of type tf::cudaFlow created at runtime similar to dynamic tasking. It manages a task node in a taskflow and associates it with a CUDA Graph. To create a cudaFlow, emplace a callable with an argument of type tf::cudaFlow. The following example implements the canonical saxpy (A·X Plus Y) task graph.

1: #include <taskflow/taskflow.hpp>
2:
3: // saxpy (single-precision A·X Plus Y) kernel
4: __global__ void saxpy(int n, float a, float *x, float *y) {
5: int i = blockIdx.x*blockDim.x + threadIdx.x;
6: if (i < n) {
7: y[i] = a*x[i] + y[i];
8: }
9: }
10:
11: // main function begins
12: int main() {
13:
14: tf::Taskflow taskflow;
15: tf::Executor executor;
16:
17: const unsigned N = 1<<20; // size of the vector
18:
19: std::vector<float> hx(N, 1.0f); // x vector at host
20: std::vector<float> hy(N, 2.0f); // y vector at host
21:
22: float *dx{nullptr}; // x vector at device
23: float *dy{nullptr}; // y vector at device
24:
25: tf::Task allocate_x = taskflow.emplace(
26: [&](){ cudaMalloc(&dx, N*sizeof(float));}
27: );
28:
29: tf::Task allocate_y = taskflow.emplace(
30: [&](){ cudaMalloc(&dy, N*sizeof(float));}
31: );
32:
33: tf::Task cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {
34: // create data transfer tasks
35: tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N); // host-to-device x data transfer
36: tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N); // host-to-device y data transfer
37: tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N); // device-to-host x data transfer
38: tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N); // device-to-host y data transfer
39:
40: // launch saxpy<<<(N+255)/256, 256, 0>>>(N, 2.0f, dx, dy)
41: tf::cudaTask kernel = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
42:
43: kernel.succeed(h2d_x, h2d_y)
44: .precede(d2h_x, d2h_y);
45: });
46: cudaflow.succeed(allocate_x, allocate_y); // overlap data allocations
47:
48: executor.run(taskflow).wait();
49:
50: taskflow.dump(std::cout); // dump the taskflow
51: }
saxpy.svg

Debrief:

  • Lines 3-9 define a saxpy kernel using CUDA
  • Lines 19-20 declare two host vectors, hx and hy
  • Lines 22-23 declare two device vector pointers, dx and dy
  • Lines 25-31 declare two tasks to allocate memory for dx and dy on device, each of N*sizeof(float) bytes
  • Lines 33-45 create a cudaFlow to capture kernel work in a graph (two host-to-device data transfer tasks, one saxpy kernel task, and two device-to-host data transfer tasks)
  • Lines 46-48 define the task dependency between host tasks and the cudaFlow tasks and execute the taskflow

Taskflow does not expend unnecessary efforts on kernel programming but focus on tasking CUDA operations with CPU work. We give users full privileges to craft a CUDA kernel that is commensurate with their domain knowledge. Users focus on developing high-performance kernels using a native CUDA toolkit, while leaving difficult task parallelism to Taskflow.

Compile a cudaFlow Program

Use nvcc (at least v10) to compile a cudaFlow program:

~$ nvcc my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow
~$ ./my_cudaflow

Our source autonomously enables cudaFlow when detecting a CUDA compiler.

Configure the Number of GPU workers

By default, the executor spawns one worker per GPU. We dedicate a worker set to each heterogeneous domain, for example, host domain and CUDA domain. If your systems has 4 CPU cores and 2 GPUs, the default number of workers spawned by the executor is 4+2, where 4 workers run CPU tasks and 2 workers run GPU tasks (cudaFlow). You can construct an executor with different numbers of GPU workers.

tf::Executor executor(17, 8); // 17 CPU workers and 8 GPU workers

The above executor spawns 17 and 8 workers for running CPU and GPU tasks, respectively. These workers coordinate with each other to balance the load in a work-stealing loop highly optimized for performance.

Run a cudaFlow on Multiple GPUs

You can run a cudaFlow on multiple GPUs by explicitly associating a cudaFlow or a kernel task with a CUDA device. A CUDA device is an integer number in the range of [0, N) representing the identifier of a GPU, where N is the number of GPUs in a system. The code below creates a cudaFlow that runs on the GPU device.

taskflow.emplace([](tf::cudaFlow& cf) {
cf.device(2);
// adding more cudaTasks below (all tasks are placed on GPU 2 unless specified explicitly)
});

You can place a kernel on a device explicitly through the method tf::cudaFlow::kernel_on that takes the device identifier in the first argument.

1: #include <taskflow/taskflow.hpp>
2:
3: // saxpy (single-precision A·X Plus Y) kernel
4: __global__ void saxpy(int n, int a, int *x, int *y, int *z) {
5: int i = blockIdx.x*blockDim.x + threadIdx.x;
6: if (i < n) {
7: z[i] = a*x[i] + y[i];
8: }
9: }
10:
11: int main() {
12:
13: const unsigned N = 1<<20;
14:
15: int* dx {nullptr};
16: int* dy {nullptr};
17: int* z1 {nullptr};
18: int* z2 {nullptr};
19:
20: cudaMallocManaged(&dx, N*sizeof(int)); // create a unified memory block for x
21: cudaMallocManaged(&dy, N*sizeof(int)); // create a unified memory block for y
22: cudaMallocManaged(&z1, N*sizeof(int)); // result of saxpy task 1
23: cudaMallocManaged(&z2, N*sizeof(int)); // result of saxpy task 2
24:
25: for(unsigned i=0; i<N; ++i) {
26: dx[i] = 1;
27: dy[i] = 2;
28: }
29:
30: tf::Taskflow taskflow;
31: tf::Executor executor;
32:
33: taskflow.emplace([&](tf::cudaFlow& cf){
34: // launch the cudaFlow on GPU 0
35: cf.device(0);
36:
37: // launch the first saxpy kernel on GPU 1
38: cf.kernel_on(1, (N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z1);
39:
40: // launch the second saxpy kernel on GPU 3
41: cf.kernel_on(3, (N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z2);
42: });
43:
44: executor.run(taskflow).wait();
45:
46: cudaFree(dx);
47: cudaFree(dy);
48:
49: // verify the solution; max_error should be zero
50: int max_error = 0;
51: for (size_t i = 0; i < N; i++) {
52: max_error = std::max(max_error, abs(z1[i]-4));
53: max_error = std::max(max_error, abs(z2[i]-4));
54: }
55: std::cout << "saxpy finished with max error: " << max_error << '\n';
56: }

Debrief:

  • Lines 3-9 define a CUDA saxpy kernel that stores the result to z
  • Lines 15-23 declare four unified memory blocks accessible from any processor
  • Lines 25-28 initialize dx and dy blocks by CPU
  • Lines 33-42 create a cudaFlow task
  • Lines 34-35 associate the cudaFlow on GPU 0
  • Lines 37-38 create a kernel task to launch the first saxpy on GPU 1 and store the result in z1
  • Lines 40-41 create a kernel task to launch the second saxpy on GPU 3 and store the result in z2
  • Lines 44-55 run the taskflow and verify the result (max_error should be zero)

Running the program gives the following nvidia-smi snapshot in a system of 4 GPUs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |
| 32% 35C P2 68W / 250W | 163MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 33% 43C P2 247W / 250W | 293MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |
| 32% 37C P0 72W / 250W | 10MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 31% 43C P2 245W / 250W | 293MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 53869 C ./a.out 153MiB |
| 1 53869 C ./a.out 155MiB |
| 3 53869 C ./a.out 155MiB |
+-----------------------------------------------------------------------------+

Even if cudaFlow provides interface for device placement, it is your responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU 2 using cudaMalloc and access it from a kernel on GPU 1. A safe practice is to allocate unified memory blocks using cudaMallocManaged and let the CUDA runtime perform automatic memory migration between processors (as demonstrated in the code example above).

As the same example, you may create two cudaFlows for the two kernels on two GPUs, respectively. The overhead of creating a kernel on the same device as a cudaFlow is much less than the different one.

tf::Task cudaFlow_on_gpu1 = taskflow.emplace([&](tf::cudaFlow& cf){
cf.device(1);
cf.kernel((N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z1);
});
tf::Task cudaFlow_on_gpu3 = taskflow.emplace([&](tf::cudaFlow& cf){
cf.device(3);
cf.kernel((N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z2);
});

GPU Memory Operations

cudaFlow provides a set of methods for users to manipulate device memory data. There are two categories, raw data and typed data. Raw data operations are methods with prefix mem, such as memcpy and memset, that take action on a device memory area in bytes. Typed data operations such as copy, fill, and zero, take logical count of elements. For instance, the following three methods have the same result of zeroing sizeof(int)*count bytes of the device memory area pointed by target.

int* target;
cudaMalloc(&target, count*sizeof(int));
taskflow.emplace([&](tf::cudaFlow& cf){
tf::cudaTask memset_target = cf.memset(target, 0, sizeof(int) * count);
tf::cudaTask same_as_above = cf.fill(target, 0, count);
tf::cudaTask same_as_above_again = cf.zero(target, count);
});

The method cudaFlow::fill is a more powerful version of cudaFlow::memset. It can fill a memory area with any value of type T, given that sizeof(T) is 1, 2, or 4 bytes. For example, the following code sets each element in the array target to 1234.

taskflow.emplace([&](tf::cudaFlow& cf){
cf.fill(target, 1234, count);
});

Similar concept applies to cudaFlow::memcpy and cudaFlow::copy as well.

taskflow.emplace([&](tf::cudaFlow& cf){
tf::cudaTask memcpy_target = cf.memcpy(target, source, sizeof(int) * count);
tf::cudaTask same_as_above = cf.copy(target, source, count);
});

Iterate a cudaFlow

You can create a cudaFlow once and launch it multiple times using cudaFlow::repeat or cudaFlow::predicate, given that the graph parameters remain unchanged across all iterations.

taskflow.emplace([&] (tf::cudaFlow& cf) {
// construct the GPU task dependency graph ...
// launch the cudaFlow 10 times
cf.repeat(10);
// equivalently
cf.predicate([n=10] () mutable { return n-- == 0; });
});

The executor iterate the execution of the cudaFlow until the predicate evaluates to true.

Granularity

Creating a cudaFlow has certain overhead, which means fined-grained tasking such as one GPU operation per cudaFlow may not give you any performance gain. You should aggregate as many GPU operations as possible in a cudaFlow to launch the entire graph once instead of separate calls. For example, the following code creates the saxpy task graph at a very fine-grained level using one cudaFlow per GPU operation.

tf::Task h2d_x = taskflow.emplace([&](tf::cudaFlow& cf) {
cf.copy(dx, hx.data(), N);
};
tf::Task h2d_y = taskflow.emplace([&](tf::cudaFlow& cf) {
cf.copy(dy, hy.data(), N);
};
tf::Task d2h_x = taskflow.emplace([&](tf::cudaFlow& cf) {
cf.copy(hx.data(), dx, N);
};
tf::Task d2h_y = taskflow.emplace([&](tf::cudaFlow& cf) {
cf.copy(hy.data(), dy, N);
};
tf::Task kernel = taskflow.emplace([&](tf::cudaFlow& cf) {
cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
};
kernel.succeed(h2d_x, h2d_y)
.precede(d2h_x, d2h_y);

The following code aggregates the five GPU operations using one cudaFlow to deliver much better performance.

tf::Task cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) {
tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N);
tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N);
tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N);
tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N);
tf::cudaTask kernel = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);
kernel.succeed(h2d_x, h2d_y)
.precede(d2h_x, d2h_y);
});

We encourage users to study and understand the parallel structure of their applications, in order to come up with the best granularity of task decomposition. A refined task graph can have significant performance difference from the raw counterpart.