cudaFlow Algorithms » Parallel Transforms

tf::cudaFlow provides a template method, tf::cudaFlow::transform, for creating a task to parallelly apply the given function to a range and stores the result in another range.

Iterator-based Parallel Transforms

Iterator-based parallel-transform applies the given transform function to a range of items and store the result in another range specified by two iterators, first and last. The two iterators are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The task created by tf::cudaFlow::transform(I first, I last, C&& callable, S... srcs) represents a kernel of parallel execution for the following loop:

while (first != last) {
  *first++ = callable(*src1++, *src2++, *src3++, ...);
}

The two iterators, first and last, are typically two raw pointers to the first element and the next to the last element in the range. The following example creates a transform kernel that assigns each element, starting from gpu_data to gpu_data + 1000, to the sum of the corresponding elements at gpu_data_x, gpu_data_y, and gpu_data_z.

taskflow.emplace([](tf::cudaFlow& cf){
  // gpu_data[i] = gpu_data_x[i] + gpu_data_y[i] + gpu_data_z[i]
  tf::cudaTask task = cf.transform(
    gpu_data, gpu_data + 1000, 
    [] __device__ (int xi, int yi, int zi) { return xi + yi + zi; },
    gpu_data_x, gpu_data_y, gpu_data_z
  ); 
});

Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__ specifier.

Miscellaneous Items

The parallel-transform algorithm is also available in tf::cudaFlowCapturer::transform.