Taskflow
2.7.0
|
Taskflow provides template function that constructs a task to perform parallel iterations over a range of items.
Index-based parallel-for performs parallel iterations over a range [beg, end)
with the given step
size. The task created by tf::Taskflow::for_each_index(B&& first, E&& last, S&& step, C&& callable) represents parallel execution of the following loop:
We support only integer-based range. The range can go positive or negative direction.
Notice that either positive or negative direction is defined in terms of the range, [first, last)
, where end
is excluded. In the positive case, the 50 items are 0, 2, 4, 6, 8, ..., 96, 98. In the negative case, the 50 items are 100, 98, 96, 04, ... 4, 2. An example of the Taskflow graph for the positive case under 12 workers is depicted below:
The index types, B
, E
, and S
, are templates to preserve the variable types and their underlying types must be of the same integral type (e.g., int
, size_t
, unsigned
). By default, tf::Taskflow::for_each_index creates a task that spawns a subflow (see C3: Dynamic Tasking) to run iterations in parallel. The subflow closure captures all input arguments through perfect forwarding to form a stateful closure such that any changes on the arguments will be visible to the execution context of the subflow. For example:
When init
finishes, the parallel-for task pf
will see first
as 0 and last
as 1000 and performs parallel iterations over the 1000 items. This property is especially important for task graph parallelism, because users can define end-to-end parallelism through stateful closures that marshal parameter exchange between dependent tasks.
Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first
and last
. The task created by tf::Taskflow::for_each(B&& first, E&& last, C&& callable) represents a parallel execution of the following loop:
By default, tf::Taskflow::for_each(B&& first, E&& last, C&& callable) creates a task that spawns a subflow (see C3: Dynamic Tasking) that applies the callable to the object obtained by dereferencing every iterator in the range [first, last)
. It is user's responsibility for ensuring the range is valid within the execution of the parallel-for task. Iterators must have the post-increment operator ++ defined. This version of parallel-for applies to all iterable STL containers.
Similar to index-based parallel-for, the iterator types are templates to enable users to leverage the property of stateful closure. For example:
When init
finishes, the parallel-for task pf
will see first
pointing to the beginning of vec
and last
pointing to the end of vec
and performs parallel iterations over the 1000 items. The two tasks form an end-to-end task graph where the parameters of parallel-for are computed on the fly.
At runtime, the parallel-for task automatically partitions the range into chunks and assign each chunk a task in the spawned subflow. Inspired by the scheduling algorithms of OpenMP, we support three partition algorithms, static partition, dynamic partition, and guided partition.
Each of these methods takes an additional unsigned argument of the chunk size.
The picture below illustrates the three partition algorithms.
By default, tf::Taskflow::for_each_index and tf::Taskflow::for_each adopt the guided partition algorithm with chunk size equal to one. In practice, guided partition produces decent performance in many applications and is the default of Taskflow's parallel-for algorithm. However, depending on the workload requirement, users may explicitly call for static, dynamic, or guided partition algorithms with a specified chunk size.