Cookbook » GPU Tasking (cudaFlowCapturer)

You can create a cudaFlow through stream capture, which allows you to capture information on GPU activities that are submitted to the stream managed by a cudaFlowCapturer.

Capture a cudaFlow

When your program has no access to direct kernel calls but invoke it through a stream-based interface (e.g., cuBLAS and cuDNN library functions), you can use tf::cudaFlowCapturer to capture the GPU activities into a cudaFlow. A cudaFlowCapturer is similar to a cudaFlow except it forms a GPU task graph through stream capture. You use the method tf::cudaFlowCapturer::on to capture a sequence of asynchronous CUDA operations through the given stream.

The following example creates a CUDA graph that captures two kernel tasks, task_1 and task_2, where task_1 (i.e., my_kernel_1) runs before task_2 (i.e., my_kernel_2).

#include <taskflow/cudaflow.hpp>

int main() {

  tf::Executor executor;
  tf::Taskflow taskflow;

  tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
    // capture my_kernel_1 through a stream managed by capturer
    tf::cudaTask task_1 = capturer.on([&](cudaStream_t stream){ 
      my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1);
    }).name("my_kernel_1");
  
    // capture my_kernel_2 through a stream managed by capturer
    tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ 
      my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2);
    }).name("my_kernel_2");
  
    // my_kernel_1 runs before my_kernel_2
    task_1.precede(task_2);
  }).name("capturer");

  executor.run(taskflow).wait();

  taskflow.dump(std::cout);

  return 0;
}
cudaFlowCapturer cluster_capturer cudaFlow: capturer custom_kernel custom_kernel other_kernel other_kernel custom_kernel->other_kernel

The stream object passed to each tf::cudaFlowCapturer::on call may differ, and it depends on how the internal optimization algorithm maximizes the GPU parallelism.

A cudaFlowCapturer lives with the callable. When the executor invoke the capturer callable, it creates the cudaFlowCapturer and will destroy it until all internal operations finish.

Common Capture Methods

cudaFlowCapturer defines a set of methods for capturing common GPU operations, such as tf::cudaFlowCapturer::kernel, tf::cudaFlowCapturer::memcpy, tf::cudaFlowCapturer::memset, and so on. For example, the following code snippet uses these pre-defined methods to construct a GPU task graph of one host-to-device copy, kernel, and one device-to-host copy, in this order of their dependencies.

tf::Task task = taskflow.emplace([](tf::cudaFlowCapturer& capturer){
  // copy data from host_data to gpu_data
  tf::cudaTask h2d = capturer.memcpy(gpu_data, host_data, bytes).name("h2d");

  // capture my_kernel to do computation on gpu_data
  tf::cudaTask kernel = capturer.on([&](cudaStream_t stream){  
    my_kernel<<<grid, block, shm_size, stream>>>(gpu_data, arg1, arg2, ...);
  }).name("my_kernel");

  // copy data from gpu_data to host_data
  tf::cudaTask d2h = capturer.memcpy(host_data, gpu_data, bytes).name("d2h");
  
  h2d.precede(kernel);
  kernel.precede(d2h);
}).name("capturer");
cudaFlowCapturer cluster_capturer cudaFlow: capturer h2d h2d my_kernel my_kernel h2d->my_kernel dh2 dh2 my_kernel->dh2

Create a Capturer on a Specific GPU

You can capture a cudaFlow on a specific GPU by calling tf::Taskflow::emplace_on. By default, a cudaFlow runs on the current GPU associated with the caller, which is typically 0. You can emplace a cudaFlowCapturer on a specific GPU.

tf::Task task = taskflow.emplace_on([](tf::cudaFlowCapturer& capturer){
  // here, capturer is under GPU device 2
  // ...
}, 2);

The above example creates a capturer on GPU 2. When the executor runs the callable, it switches to GPU 2 and all the functions within the callable are called under this context.

Create a Capturer within a cudaFlow

Within a parent cudaFlow, you can capture a cudaFlow to form a subflow that eventually becomes a child node in the underlying CUDA task graph. The following example defines a captured flow task2 of two dependent tasks, task2_1 and task2_2, and task2 runs after task1.

tf::Task task = taskflow.emplace([&](tf::cudaFlow& cf){

  tf::cudaTask task1 = cf.kernel(grid, block, shm, my_kernel, args...)
                         .name("my_kernel");
  
  // task2 forms a subflow in cf and becomes a child node in the underlying 
  // CUDA graph
  tf::cudaTask task2 = cf.capture([&](tf::cudaFlowCapturer& capturer){
    
    // capture my_kernel1 using the given stream
    tf::cudaTask task2_1 = capturer.on([&](cudaStream_t stream){  
      my_kernel2<<<grid1, block1, shm_size1, stream>>>(args1...);
    }).name("my_kernel1");  
    
    // capture my_kernel2 using the given stream
    tf::cudaTask task2_2 = capturer.on([&](cudaStream_t stream){  
      my_kernel2<<<grid2, block2, shm_size2, stream>>>(args2...);
    }).name("my_kernel2");   

    task2_1.precede(task2_2);
  }).name("capturer");

  task1.precede(task2);
}).name("cudaFlow");
Taskflow cluster_p0x14117f0 cudaFlow: cudaFlow cluster_p0x7fb730001510 cudaSubflow: capturer p0x14117f0 cudaFlow p0x7fb730000e60 my_kernel p0x7fb730001510 capturer p0x7fb730000e60->p0x7fb730001510 p0x7fb730001510->p0x14117f0 p0x7fb7300015b0 my_kernel1 p0x7fb730001650 my_kernel2 p0x7fb7300015b0->p0x7fb730001650 p0x7fb730001650->p0x7fb730001510

Create a Custom Capturer

By inheriting tf::cudaFlowCapturerBase, you can create your own capturer. The tf::cudaFlowCapturer has a factory interface, tf::cudaFlowCapturer::make_capturer, for users to create custom capturers with lifetimes managed by a parent capturer. The parent capturer is accessible through tf::cudaFlowCapturerBase::factory. This is convenient when you need certain objects alive during the capturing. The following example defines a custom capturer that takes the given stream from tf::cudaFlowCapturer::on to capture a custom kernel.

class MyCapturer : public tf::cudaFlowCapturerBase {

  public:

  // capture a custom kernel
  tf::cudaTask capture_custom_kernel(args...) {
    return factory()->on([this, args...](cudaStream_t stream){
      invoke_custom_kernel(stream, args...);
    });
  }
};

tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
  // create a custom capturer, MyCapturer, from the parent capturer (factory)
  MyCapturer* mc = capturer.make_capturer<MyCapturer>();

  tf::cudaTask task_1 = mc->capture_custom_kernel(args...).name("custom_kernel");
  tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){
    other_kernel<<<grid, block, shm_size, stream>>>(other_args...);
  }).name("other_kernel");

  task_1.precede(task_2);
}).name("capturer");
cudaFlowCapturer cluster_capturer cudaFlow: capturer custom_kernel custom_kernel other_kernel other_kernel custom_kernel->other_kernel

Taskflow provides several class interfaces, such as tf::cublasFlowCapturer, on top of tf::cudaFlowCapturerBase for users to use external high-performance CUDA libraries together with Taskflow.

Offload a cudaFlow Capturer

By default, the executor offloads and executes the cudaFlow capturer once. When you offload a cudaFlow capturer, an executable graph will be created. The system runtime transforms the user-described graph into an executable graph optimized for maximum stream concurrency. Depending on the optimization, the user-described graph may be different from the actual executable graph. You can explicitly execute a cudaFlow using different offload methods:

taskflow.emplace([](tf::cudaFlowCapturer& cf) {
  // ... capture CUDA tasks
  cf.offload();      // offload the cudaFlow capturer and run it once
  cf.offload_n(10);  // offload the cudaFlow capturer and run it 10 times
  cf.offload_until([repeat=5] () mutable { return repeat-- == 0; })  // five times
});

After you offload a cudaFlow capturer, it is considered executed, and the executor will not run an offloaded cudaFlow after leaving the cudaFlow capturer task callable. On the other hand, if a cudaFlow capturer is not offloaded, the executor runs it once. For example, the following two versions represent the same execution logic.

// version 1: explicitly offload a cudaFlow capturer once
taskflow.emplace([](tf::cudaFlowCapturer& cf) {
  cf.single_task([] __device__ (){});
  cf.offload();
});

// version 2 (same as version 1): executor offloads the cudaFlow capturer once
taskflow.emplace([](tf::cudaFlowCapturer& sf) {
  cf.single_task([] __device__ (){});
});

Update a cudaFlow Capturer

Between successive offloads (i.e., executions of a cudaFlow capturer), you can update the captured task by rebinding it to another task type. For example, you can rebind a kernel task to a memory task from an offloaded cudaFlow capturer.

taskflow.emplace(tf::cudaFlowCapturer& cf) {
  tf::cudaTask task = cf.kernel(grid1, block1, shm1, kernel1, kernel1_args);
  cf.offload();

  // rebind task to another kernel with different parameters
  cf.rebind_kernel(task, grid2, block2, shm2, kernel2, kernel2_args);
  cf.offload();

  // rebind task to another task type is OK in a capturer
  cf.rebind_memset(task, target, 0, num_bytes);
  cf.offload();
};

When you call a rebind method, it destroys the underlying executable graph if it exists, and a new executable graph will be recreated for the next offload. A cudaFlow capturer does not maintain an one-to-one mapping between the user-described graph and the actual executable graph, and therefore the rebind methods are more flexible than the update methods of tf::cudaFlow. Each method of task creation in tf::cudaFlowCapturer has a corresponding method of rebinding a task to that task type (e.g., tf::cudaFlowCapturer::on and tf::cudaFlowCapturer::rebind_on, tf::cudaFlowCapturer::kernel and tf::cudaFlowCapturer::kernel).

Use cudaFlow Capturer in a Standalone Environment

You can use tf::cudaFlowCapturer in a standalone environment without going through tf::Taskflow and offloads it to a GPU from the caller thread. All the features we have discussed so far apply to the standalone use.

tf::cudaFlowCapturer cf;  // create a standalone cudaFlow capturer

tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x");
tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y");
tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x");
tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y");
tf::cudaTask saxpy = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy)
                       .name("saxpy");

saxpy.succeed(h2d_x, h2d_y)   // kernel runs after  host-to-device copy
     .precede(d2h_x, d2h_y);  // kernel runs before device-to-host copy

cf.offload();  // offload and run the standalone cudaFlow capturer once

When using cudaFlow Capturer in a standalone environment, it is your choice to decide its GPU context. The following example creates a cudaFlow capturer and executes it on GPU 2.

tf::cudaScopedDevice gpu(2);
tf::cudaFlowCapturer cf;  // create a standalone cudaFlow capturer on GPU 2
cf.offload();             // run the capturer once on GPU 2