Cookbook » GPU Tasking (cudaFlowCapturer)

You can create a cudaFlow through stream capture, which allows you to capture information on GPU activities that are submitted to the stream managed by a cudaFlowCapturer.

Capture a cudaFlow

When your program has no access to direct kernel calls but invoke it through a stream-based interface (e.g., cuBLAS and cuDNN library functions), you can use tf::cudaFlowCapturer to capture the GPU activities into a cudaFlow. A cudaFlowCapturer is similar to a cudaFlow except it forms a GPU task graph through stream capture. You use the method tf::cudaFlowCapturer::on to capture a sequence of asynchronous CUDA operations through the given stream.

The following example creates a CUDA graph that captures two kernel tasks, task_1 and task_2, where task_1 (i.e., my_kernel_1) runs before task_2 (i.e., my_kernel_2).

#include <taskflow/cudaflow.hpp>

int main() {

  tf::Executor executor;
  tf::Taskflow taskflow;

  tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
    // capture my_kernel_1 through a stream managed by capturer
    tf::cudaTask task_1 = capturer.on([&](cudaStream_t stream){ 
      my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1);
    }).name("my_kernel_1");
  
    // capture my_kernel_2 through a stream managed by capturer
    tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){ 
      my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2);
    }).name("my_kernel_2");
  
    // my_kernel_1 runs before my_kernel_2
    task_1.precede(task_2);
  }).name("capturer");

  executor.run(taskflow).wait();

  taskflow.dump(std::cout);

  return 0;
}
cudaFlowCapturer cluster_capturer cudaFlow: capturer custom_kernel custom_kernel other_kernel other_kernel custom_kernel->other_kernel

The stream object passed to each tf::cudaFlowCapturer::on call may differ, and it depends on how the internal optimization algorithm maximizes the GPU parallelism.

A cudaFlowCapturer lives with the callable. When the executor invoke the capturer callable, it creates the cudaFlowCapturer and will destroy it until all internal operations finish.

Common Capture Methods

cudaFlowCapturer defines a set of methods for capturing common GPU operations, such as tf::cudaFlowCapturer::kernel, tf::cudaFlowCapturer::memcpy, tf::cudaFlowCapturer::memset, and so on. For example, the following code snippet uses these pre-defined methods to construct a GPU task graph of one host-to-device copy, kernel, and one device-to-host copy, in this order of their dependencies.

tf::Task task = taskflow.emplace([](tf::cudaFlowCapturer& capturer){
  // copy data from host_data to gpu_data
  tf::cudaTask h2d = capturer.memcpy(gpu_data, host_data, bytes).name("h2d");

  // capture my_kernel to do computation on gpu_data
  tf::cudaTask kernel = capturer.on([&](cudaStream_t stream){  
    my_kernel<<<grid, block, shm_size, stream>>>(gpu_data, arg1, arg2, ...);
  }).name("my_kernel");

  // copy data from gpu_data to host_data
  tf::cudaTask d2h = capturer.memcpy(host_data, gpu_data, bytes).name("d2h");
  
  h2d.precede(kernel);
  kernel.precede(d2h);
}).name("capturer");
cudaFlowCapturer cluster_capturer cudaFlow: capturer h2d h2d my_kernel my_kernel h2d->my_kernel dh2 dh2 my_kernel->dh2

Create a Capturer on a Specific GPU

You can capture a cudaFlow on a specific GPU by calling tf::Taskflow::emplace_on. By default, a cudaFlow runs on the current GPU associated with the caller, which is typically 0. Similar to Run a cudaFlow on Multiple GPUs, you can emplace a cudaFlowCapturer on a specific GPU.

tf::Task task = taskflow.emplace_on([](tf::cudaFlowCapturer& capturer){
  // here, capturer is under GPU device context 2
  // ...
}, 2);

The above example creates a capturer on GPU 2. When the executor runs the callable, it switches to GPU 2 and all the functions within the callable are called under this context.

Create a Capturer within a cudaFlow

Within a parent cudaFlow, you can capture a cudaFlow to form a subflow that eventually becomes a child node in the underlying CUDA task graph. The following example defines a captured flow task2 of two dependent tasks, task2_1 and task2_2, and task2 runs after task1.

tf::Task task = taskflow.emplace([&](tf::cudaFlow& cf){

  tf::cudaTask task1 = cf.kernel(grid, block, shm, my_kernel, args...)
                         .name("my_kernel");
  
  // task2 forms a subflow in cf and becomes a child node in the underlying 
  // CUDA graph
  tf::cudaTask task2 = cf.capture([&](tf::cudaFlowCapturer& capturer){
    
    // capture my_kernel1 using the given stream
    tf::cudaTask task2_1 = capturer.on([&](cudaStream_t stream){  
      my_kernel2<<<grid1, block1, shm_size1, stream>>>(args1...);
    }).name("my_kernel1");  
    
    // capture my_kernel2 using the given stream
    tf::cudaTask task2_2 = capturer.on([&](cudaStream_t stream){  
      my_kernel2<<<grid2, block2, shm_size2, stream>>>(args2...);
    }).name("my_kernel2");   

    task2_1.precede(task2_2);
  }).name("capturer");

  task1.precede(task2);
}).name("cudaFlow");
Taskflow cluster_p0x14117f0 cudaFlow: cudaFlow cluster_p0x7fb730001510 cudaSubflow: capturer p0x14117f0 cudaFlow p0x7fb730000e60 my_kernel p0x7fb730001510 capturer p0x7fb730000e60->p0x7fb730001510 p0x7fb730001510->p0x14117f0 p0x7fb7300015b0 my_kernel1 p0x7fb730001650 my_kernel2 p0x7fb7300015b0->p0x7fb730001650 p0x7fb730001650->p0x7fb730001510

Create a Custom Capturer

By inheriting tf::cudaFlowCapturerBase, you can create your own capturer. The tf::cudaFlowCapturer has a factory interface, tf::cudaFlowCapturer::make_capturer, for users to create custom capturers with lifetimes managed by a parent capturer. The parent capturer is accessible through tf::cudaFlowCapturerBase::factory. This is convenient when you need certain objects alive during the capturing. The following example defines a custom capturer that takes the given stream from tf::cudaFlowCapturer::on to capture a custom kernel.

class MyCapturer : public tf::cudaFlowCapturerBase {

  public:

  // capture a custom kernel
  tf::cudaTask capture_custom_kernel(args...) {
    return factory()->on([this, args...](cudaStream_t stream){
      invoke_custom_kernel(stream, args...);
    });
  }
};

tf::Task task = taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
  // create a custom capturer, MyCapturer, from the parent capturer (factory)
  MyCapturer* mc = capturer.make_capturer<MyCapturer>();

  tf::cudaTask task_1 = mc->capture_custom_kernel(args...).name("custom_kernel");
  tf::cudaTask task_2 = capturer.on([&](cudaStream_t stream){
    other_kernel<<<grid, block, shm_size, stream>>>(other_args...);
  }).name("other_kernel");

  task_1.precede(task_2);
}).name("capturer");
cudaFlowCapturer cluster_capturer cudaFlow: capturer custom_kernel custom_kernel other_kernel other_kernel custom_kernel->other_kernel

Taskflow provides several class interfaces, such as tf::cublasFlowCapturer, on top of tf::cudaFlowCapturerBase for users to use external high-performance CUDA libraries together with Taskflow.

Offload and Update a cudaFlow Capturer

Similar to Offload and Update a cudaFlow, you can offload a cudaFlow capturer explicitly and then update the graph for iterative executions. When you offload a cudaFlow capturer, an executable graph will be created. The system runtime transforms the user-described graph into an executable graph optimized for maximum stream concurrency. Depending on the optimization, the user-described graph may be different from the actual executable graph. The following example captures a kernel task and runs it twice using tf::cudaFlowCapturer::offload.

1: taskflow.emplace([&] (tf::cudaFlowCapturer& cf) {
2:   tf::cudaTask task = cf.kernel(grid1, block1, shm1, my_kernel, args1...);
3:   cf.offload();
4:   cf.offload();
5: });

We currently supports the following offload methods:

taskflow.emplace(tf::cudaFlowCapturer& cf) {
  // ... create capture tasks
  cf.offload();      // offload the cudaFlow capturer and run it once
  cf.offload_n(10);  // offload the cudaFlow capturer and run it 10 times
  cf.offload_until([repeat=5] () mutable { return repeat-- == 0; })  // 5 times
};

After you offload a cudaFlow capturer (possibly multiple times), it is considered executed, and the executor will not run an offloaded cudaFlow capturer after leaving the cudaFlow capturer task callable. On the contrary, the executor will run a un-offloaded cudaFlow capturer once. For example, the following two versions represent the same execution logic.

// version 1: explicitly offload a cudaFlow capturer once
taskflow.emplace(tf::cudaFlowCapturer& cf) {
  cf.kernel(grid, block, shm, my_kernel, my_kernel_args).name("my_kernel");
  cf.offload();
};

// version 2 (same as version 1): executor offloads the cudaFlow capturer once
taskflow.emplace(tf::cudaFlowCapturer& cf) {
  cf.kernel(grid, block, shm, my_kernel, my_kernel_args).name("my_kernel");
};

Between successive offloads (i.e., executions of a cudaFlow capturer), you can update the graph using rebind methods in tf::cudaFlowCapturer. Unlike the update methods in tf::cudaFlow (see Offload and Update a cudaFlow) that restrict you from changing the topology, you can alter the graph topology and rebind tasks to different ones in a cudaFlow capturer after an offload.

taskflow.emplace(tf::cudaFlowCapturer& cf) {
  tf::cudaTask task = cf.kernel(grid1, block1, shm1, kernel1, kernel1_args);
  cf.offload();

  // rebind task to another kernel with different parameters
  cf.rebind_kernel(task, grid2, block2, shm2, kernel2, kernel2_args);
  cf.offload();
};

When you call a rebind method, it destroys the underlying executable graph if it exists, and a new executable graph will be recreated for the next offload. A cudaFlow capturer does not maintain an one-to-one mapping between the user-described graph and the actual executable graph. We currently support the following rebind methods:

Each of these rebind methods takes an additional argument of a previously created capture task. The rest arguments remain the same as the corresponding task creation methods.

Use cudaFlow Capturer in a Standalone Environment

You can use tf::cudaFlowCapturer in a standalone environment without going through tf::Taskflow and offloads it to GPU from the caller thread. All the features we have discussed so far are applicable for the standalone use.

tf::cudaFlowCapturer cf;  // create a standalone cudaFlow capturer

tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x");
tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y");
tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x");
tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y");
tf::cudaTask saxpy = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy)
                       .name("saxpy");

saxpy.succeed(h2d_x, h2d_y)   // kernel runs after  host-to-device copy
     .precede(d2h_x, d2h_y);  // kernel runs before device-to-host copy

cf.offload();  // offload and run the standalone cudaFlow capturer once

The following example creates a cudaFlow capturer and executes it on GPU 2.

tf::cudaScopedDevice gpu(2);
tf::cudaFlowCapturer cf;  // create a standalone cudaFlow capturer on GPU 2
cf.offload();             // run the capturer once on GPU 2