Parallel Reduction
Contents
tf::
Reduce Items with an Initial Value
The reduction task created by tf::[first, last)
using the binary operator bop
and stores the reduced result in result
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) { *result = op(*result, *first++); }
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the numbers in the given range with an initial value 1000
:
const size_t N = 1000000; tf::Taskflow taskflow; tf::Executor executor; std::vector<int> cpu(N, -1); int sol; // solution on CPU int* res = tf::cuda_malloc<int>(1); // solution on GPU int* gpu = tf::cuda_malloc<int>(N); // data on GPU taskflow.emplace([&](tf::cudaFlow& cf) { tf::cudaTask d2h = cf.copy(&sol, res, 1); tf::cudaTask h2d = cf.copy(gpu, cpu.data(), N); tf::cudaTask set = cf.single_task([res] __device__ () mutable { *res = 1000; // assign an initial value to res }); tf::cudaTask kernel = cf.reduce( gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; } ); kernel.succeed(h2d, set) .precede(d2h); }); executor.run(taskflow).wait(); assert(sol == N + 1000);
Reduce Items without an Initial Value
You can use tf::
*result = *first++; // no initial values participate in the reduction loop while (first != last) { *result = op(*result, *first++); }
The variable result
is overwritten with the reduced value and no initial values participate in the reduction loop. The following code performs a parallel reduction to sum all the numbers in the given range without any initial value:
const size_t N = 1000000; tf::Taskflow taskflow; tf::Executor executor; std::vector<int> cpu(N, -1); int sol; // solution on CPU int* res = tf::cuda_malloc<int>(1); // solution on GPU int* gpu = tf::cuda_malloc<int>(N); // data on GPU taskflow.emplace([&](tf::cudaFlow& cf) { tf::cudaTask d2h = cf.copy(&sol, res, 1); tf::cudaTask h2d = cf.copy(gpu, cpu.data(), N); tf::cudaTask kernel = cf.uninitialized_reduce( gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; } ); kernel.succeed(h2d) .precede(d2h); }); executor.run(taskflow).wait(); assert(sol == N);
Miscellaneous Items
Parallel-reduction algorithms are also available in tf::