GPU Tasking (cudaFlowCapturer)
Contents
You can create a cudaFlow through stream capture, which allows you to capture information on GPU activities that are submitted to the stream managed by a cudaFlowCapturer.
Capture a cudaFlow
When your program has no access to direct kernel calls but invoke it through a stream-based interface (e.g., cuBLAS and cuDNN library functions), you can use tf::
The following example creates a CUDA graph that captures two kernel tasks, task_1
and task_2
, where task_1
(i.e., my_kernel_1
) runs before task_2
(i.e., my_kernel_2
).
#include <taskflow/cudaflow.hpp> int main() { tf::Executor executor; tf::Taskflow taskflow; tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){ // capture my_kernel_1 through a stream managed by capturer tf::cudaTask task_1 = capturer.on([&](cudaStream_t stream){ my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1); }).name("my_kernel_1"); // capture my_kernel_2 through a stream managed by capturer tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2); }).name("my_kernel_2"); // my_kernel_1 runs before my_kernel_2 task_1.precede(task_2); }).name("capturer"); executor.run(taskflow).wait(); taskflow.dump(std::cout); return 0; }
The stream object passed to each tf::
A cudaFlowCapturer lives with the callable. When the executor invoke the capturer callable, it creates the cudaFlowCapturer and will destroy it until all internal operations finish.
Common Capture Methods
cudaFlowCapturer defines a set of methods for capturing common GPU operations, such as tf::
tf::Task task = taskflow.emplace([](tf::cudaFlowCapturer& capturer){ // copy data from host_data to gpu_data tf::cudaTask h2d = capturer.memcpy(gpu_data, host_data, bytes).name("h2d"); // capture my_kernel to do computation on gpu_data tf::cudaTask kernel = capturer.on([&](cudaStream_t stream){ my_kernel<<<grid, block, shm_size, stream>>>(gpu_data, arg1, arg2, ...); }).name("my_kernel"); // copy data from gpu_data to host_data tf::cudaTask d2h = capturer.memcpy(host_data, gpu_data, bytes).name("d2h"); h2d.precede(kernel); kernel.precede(d2h); }).name("capturer");
Create a Capturer on a Specific GPU
You can capture a cudaFlow on a specific GPU by calling tf::
tf::Task task = taskflow.emplace_on([](tf::cudaFlowCapturer& capturer){ // here, capturer is under GPU device context 2 // ... }, 2);
The above example creates a capturer on GPU 2. When the executor runs the callable, it switches to GPU 2 and all the functions within the callable are called under this context.
Create a Capturer within a cudaFlow
Within a parent cudaFlow, you can capture a cudaFlow to form a subflow that eventually becomes a child node in the underlying CUDA task graph. The following example defines a captured flow task2
of two dependent tasks, task2_1
and task2_2
, and task2
runs after task1
.
tf::Task task = taskflow.emplace([&](tf::cudaFlow& cf){ tf::cudaTask task1 = cf.kernel(grid, block, shm, my_kernel, args...) .name("my_kernel"); // task2 forms a subflow in cf and becomes a child node in the underlying // CUDA graph tf::cudaTask task2 = cf.capture([&](tf::cudaFlowCapturer& capturer){ // capture my_kernel1 using the given stream tf::cudaTask task2_1 = capturer.on([&](cudaStream_t stream){ my_kernel2<<<grid1, block1, shm_size1, stream>>>(args1...); }).name("my_kernel1"); // capture my_kernel2 using the given stream tf::cudaTask task2_2 = capturer.on([&](cudaStream_t stream){ my_kernel2<<<grid2, block2, shm_size2, stream>>>(args2...); }).name("my_kernel2"); task2_1.precede(task2_2); }).name("capturer"); task1.precede(task2); }).name("cudaFlow");
Create a Custom Capturer
By inheriting tf::
class MyCapturer : public tf::cudaFlowCapturerBase { public: // capture a custom kernel tf::cudaTask capture_custom_kernel(args...) { return factory()->on([this, args...](cudaStream_t stream){ invoke_custom_kernel(stream, args...); }); } }; tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){ // create a custom capturer, MyCapturer, from the parent capturer (factory) MyCapturer* mc = capturer.make_capturer<MyCapturer>(); tf::cudaTask task_1 = mc->capture_custom_kernel(args...).name("custom_kernel"); tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ other_kernel<<<grid, block, shm_size, stream>>>(other_args...); }).name("other_kernel"); task_1.precede(task_2); }).name("capturer");
Taskflow provides several class interfaces, such as tf::
Offload and Update a cudaFlow Capturer
Similar to Offload and Update a cudaFlow, you can offload a cudaFlow capturer explicitly and then update the graph for iterative executions. When you offload a cudaFlow capturer, an executable graph will be created. The system runtime transforms the user-described graph into an executable graph optimized for maximum stream concurrency. Depending on the optimization, the user-described graph may be different from the actual executable graph. The following example captures a kernel task and runs it twice using tf::
1: taskflow.emplace([&] (tf::cudaFlowCapturer& cf) { 2: tf::cudaTask task = cf.kernel(grid1, block1, shm1, my_kernel, args1...); 3: cf.offload(); 4: cf.offload(); 5: });
We currently supports the following offload methods:
- tf::
cudaFlowCapturer:: offload offloads and runs the cudaFlow capturer once - tf::
cudaFlowCapturer:: offload_n offloads and runs the cudaFlow capturer n
times - tf::
cudaFlowCapturer:: offload_until offloads and repeatedly runs the cudaFlow capturer until the given predicate returns true
taskflow.emplace(tf::cudaFlowCapturer& cf) { // ... create capture tasks cf.offload(); // offload the cudaFlow capturer and run it once cf.offload_n(10); // offload the cudaFlow capturer and run it 10 times cf.offload_until([repeat=5] () mutable { return repeat-- == 0; }) // 5 times };
After you offload a cudaFlow capturer (possibly multiple times), it is considered executed, and the executor will not run an offloaded cudaFlow capturer after leaving the cudaFlow capturer task callable. On the contrary, the executor will run a un-offloaded cudaFlow capturer once. For example, the following two versions represent the same execution logic.
// version 1: explicitly offload a cudaFlow capturer once taskflow.emplace(tf::cudaFlowCapturer& cf) { cf.kernel(grid, block, shm, my_kernel, my_kernel_args).name("my_kernel"); cf.offload(); }; // version 2 (same as version 1): executor offloads the cudaFlow capturer once taskflow.emplace(tf::cudaFlowCapturer& cf) { cf.kernel(grid, block, shm, my_kernel, my_kernel_args).name("my_kernel"); };
Between successive offloads (i.e., executions of a cudaFlow capturer), you can update the graph using rebind methods in tf::
taskflow.emplace(tf::cudaFlowCapturer& cf) { tf::cudaTask task = cf.kernel(grid1, block1, shm1, kernel1, kernel1_args); cf.offload(); // rebind task to another kernel with different parameters cf.rebind_kernel(task, grid2, block2, shm2, kernel2, kernel2_args); cf.offload(); };
When you call a rebind method, it destroys the underlying executable graph if it exists, and a new executable graph will be recreated for the next offload. A cudaFlow capturer does not maintain an one-to-one mapping between the user-described graph and the actual executable graph. We currently support the following rebind methods:
- tf::
cudaFlowCapturer:: rebind_on: rebinds a task to a new sequence of GPU operations - tf::
cudaFlowCapturer:: rebind_memcpy: rebinds a task to a memcpy operation - tf::
cudaFlowCapturer:: rebind_memset: rebinds a task to a memset operation - tf::
cudaFlowCapturer:: rebind_copy: rebinds a task to a copy operation - tf::
cudaFlowCapturer:: rebind_kernel: rebinds a task to a kernel operation
Each of these rebind methods takes an additional argument of a previously created capture task. The rest arguments remain the same as the corresponding task creation methods.
Use cudaFlow Capturer in a Standalone Environment
You can use tf::
tf::cudaFlowCapturer cf; // create a standalone cudaFlow capturer tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x"); tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y"); tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x"); tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y"); tf::cudaTask saxpy = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy) .name("saxpy"); saxpy.succeed(h2d_x, h2d_y) // kernel runs after host-to-device copy .precede(d2h_x, d2h_y); // kernel runs before device-to-host copy cf.offload(); // offload and run the standalone cudaFlow capturer once
The following example creates a cudaFlow capturer and executes it on GPU 2.
tf::cudaScopedDevice gpu(2); tf::cudaFlowCapturer cf; // create a standalone cudaFlow capturer on GPU 2 cf.offload(); // run the capturer once on GPU 2