GPU Tasking (cudaFlowCapturer)
Contents
You can create a cudaFlow through stream capture, which allows you to capture information on GPU activities that are submitted to the stream managed by a cudaFlowCapturer.
Capture a cudaFlow
When your program has no access to direct kernel calls but invoke it through a stream-based interface (e.g., cuBLAS and cuDNN library functions), you can use tf::
The following example creates a CUDA graph that captures two kernel tasks, task_1
and task_2
, where task_1
(i.e., my_kernel_1
) runs before task_2
(i.e., my_kernel_2
).
#include <taskflow/cudaflow.hpp> int main() { tf::Executor executor; tf::Taskflow taskflow; tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){ // capture my_kernel_1 through a stream managed by capturer tf::cudaTask task_1 = capturer.on([&](cudaStream_t stream){ my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1); }).name("my_kernel_1"); // capture my_kernel_2 through a stream managed by capturer tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2); }).name("my_kernel_2"); // my_kernel_1 runs before my_kernel_2 task_1.precede(task_2); }).name("capturer"); executor.run(taskflow).wait(); taskflow.dump(std::cout); return 0; }
The stream object passed to each tf::
A cudaFlowCapturer lives with the callable. When the executor invoke the capturer callable, it creates the cudaFlowCapturer and will destroy it until all internal operations finish.
Common Capture Methods
cudaFlowCapturer defines a set of methods for capturing common GPU operations, such as tf::
tf::Task task = taskflow.emplace([](tf::cudaFlowCapturer& capturer){ // copy data from host_data to gpu_data tf::cudaTask h2d = capturer.memcpy(gpu_data, host_data, bytes).name("h2d"); // capture my_kernel to do computation on gpu_data tf::cudaTask kernel = capturer.on([&](cudaStream_t stream){ my_kernel<<<grid, block, shm_size, stream>>>(gpu_data, arg1, arg2, ...); }).name("my_kernel"); // copy data from gpu_data to host_data tf::cudaTask d2h = capturer.memcpy(host_data, gpu_data, bytes).name("d2h"); h2d.precede(kernel); kernel.precede(d2h); }).name("capturer");
Create a Capturer on a Specific GPU
You can capture a cudaFlow on a specific GPU by calling tf::
tf::Task task = taskflow.emplace_on([](tf::cudaFlowCapturer& capturer){ // here, capturer is under GPU device 2 // ... }, 2);
The above example creates a capturer on GPU 2. When the executor runs the callable, it switches to GPU 2 and all the functions within the callable are called under this context.
Create a Capturer within a cudaFlow
Within a parent cudaFlow, you can capture a cudaFlow to form a subflow that eventually becomes a child node in the underlying CUDA task graph. The following example defines a captured flow task2
of two dependent tasks, task2_1
and task2_2
, and task2
runs after task1
.
tf::Task task = taskflow.emplace([&](tf::cudaFlow& cf){ tf::cudaTask task1 = cf.kernel(grid, block, shm, my_kernel, args...) .name("my_kernel"); // task2 forms a subflow in cf and becomes a child node in the underlying // CUDA graph tf::cudaTask task2 = cf.capture([&](tf::cudaFlowCapturer& capturer){ // capture my_kernel1 using the given stream tf::cudaTask task2_1 = capturer.on([&](cudaStream_t stream){ my_kernel2<<<grid1, block1, shm_size1, stream>>>(args1...); }).name("my_kernel1"); // capture my_kernel2 using the given stream tf::cudaTask task2_2 = capturer.on([&](cudaStream_t stream){ my_kernel2<<<grid2, block2, shm_size2, stream>>>(args2...); }).name("my_kernel2"); task2_1.precede(task2_2); }).name("capturer"); task1.precede(task2); }).name("cudaFlow");
Create a Custom Capturer
By inheriting tf::
class MyCapturer : public tf::cudaFlowCapturerBase { public: // capture a custom kernel tf::cudaTask capture_custom_kernel(args...) { return factory()->on([this, args...](cudaStream_t stream){ invoke_custom_kernel(stream, args...); }); } }; tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){ // create a custom capturer, MyCapturer, from the parent capturer (factory) MyCapturer* mc = capturer.make_capturer<MyCapturer>(); tf::cudaTask task_1 = mc->capture_custom_kernel(args...).name("custom_kernel"); tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ other_kernel<<<grid, block, shm_size, stream>>>(other_args...); }).name("other_kernel"); task_1.precede(task_2); }).name("capturer");
Taskflow provides several class interfaces, such as tf::
Offload a cudaFlow Capturer
By default, the executor offloads and executes the cudaFlow capturer once. When you offload a cudaFlow capturer, an executable graph will be created. The system runtime transforms the user-described graph into an executable graph optimized for maximum stream concurrency. Depending on the optimization, the user-described graph may be different from the actual executable graph. You can explicitly execute a cudaFlow using different offload methods:
taskflow.emplace([](tf::cudaFlowCapturer& cf) { // ... capture CUDA tasks cf.offload(); // offload the cudaFlow capturer and run it once cf.offload_n(10); // offload the cudaFlow capturer and run it 10 times cf.offload_until([repeat=5] () mutable { return repeat-- == 0; }) // five times });
After you offload a cudaFlow capturer, it is considered executed, and the executor will not run an offloaded cudaFlow after leaving the cudaFlow capturer task callable. On the other hand, if a cudaFlow capturer is not offloaded, the executor runs it once. For example, the following two versions represent the same execution logic.
// version 1: explicitly offload a cudaFlow capturer once taskflow.emplace([](tf::cudaFlowCapturer& cf) { cf.single_task([] __device__ (){}); cf.offload(); }); // version 2 (same as version 1): executor offloads the cudaFlow capturer once taskflow.emplace([](tf::cudaFlowCapturer& sf) { cf.single_task([] __device__ (){}); });
Update a cudaFlow Capturer
Between successive offloads (i.e., executions of a cudaFlow capturer), you can update the captured task by rebinding it to another task type. For example, you can rebind a kernel task to a memory task from an offloaded cudaFlow capturer.
taskflow.emplace(tf::cudaFlowCapturer& cf) { tf::cudaTask task = cf.kernel(grid1, block1, shm1, kernel1, kernel1_args); cf.offload(); // rebind task to another kernel with different parameters cf.rebind_kernel(task, grid2, block2, shm2, kernel2, kernel2_args); cf.offload(); // rebind task to another task type is OK in a capturer cf.rebind_memset(task, target, 0, num_bytes); cf.offload(); };
When you call a rebind method, it destroys the underlying executable graph if it exists, and a new executable graph will be recreated for the next offload. A cudaFlow capturer does not maintain an one-to-one mapping between the user-described graph and the actual executable graph, and therefore the rebind methods are more flexible than the update methods of tf::
Use cudaFlow Capturer in a Standalone Environment
You can use tf::
tf::cudaFlowCapturer cf; // create a standalone cudaFlow capturer tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x"); tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y"); tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x"); tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y"); tf::cudaTask saxpy = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy) .name("saxpy"); saxpy.succeed(h2d_x, h2d_y) // kernel runs after host-to-device copy .precede(d2h_x, d2h_y); // kernel runs before device-to-host copy cf.offload(); // offload and run the standalone cudaFlow capturer once
When using cudaFlow Capturer in a standalone environment, it is your choice to decide its GPU context. The following example creates a cudaFlow capturer and executes it on GPU 2.
tf::cudaScopedDevice gpu(2); tf::cudaFlowCapturer cf; // create a standalone cudaFlow capturer on GPU 2 cf.offload(); // run the capturer once on GPU 2