
Running a for-loop in parallel is the most fundamental building block in parallel programming.
In this chapter, we are going to demonstrate how to use Cpp-Taskflow
to create a task dependency graph of parallel for-loop.

@section RangeBasedForLoop Range-based For-loop

Cpp-Taskflow has a STL-style method 
tf::Taskflow::parallel_for(I beg, I end, C&& callable, size_t partitions)
that partitions a range of items and applies a callable to each item in the partition in parallel.
The method constructs a task dependency graph representing this workload
and returns a task pair as two synchronization points to this task graph.

@code{.cpp}
 1: tf::Taskflow taskflow;
 2:
 3: std::vector<int> items {1, 2, 3, 4, 5, 6, 7, 8};
 4:
 5: auto [S, T] = taskflow.parallel_for(items.begin(), items.end(), [] (int item) {
 6:   std::cout << std::this_thread::get_id() << " runs " << item << std::endl;
 7: });
 8:
 9: S.work([](){ std::cout << "S\n"; }).name("S");
10: T.work([](){ std::cout << "T\n"; }).name("T");
11:
12: taskflow.dump(std::cout);
@endcode

The above code generates the following task dependency graph. 
The label 0x56\* represents an internal task node to execute the callable object.
By default (@c partitions=0), Cpp-Taskflow evenly partitions and distributes the workload 
across the maximum hardware concurrency.
Suppose our example has four logical cores,
each internal node corresponds to a partition taking two items (eight tasks in total).

@image html images/parallel_for1.png width=60%

Debrief:
@li Line 1 creates a taskflow object of four worker threads
@li Line 3 creates a vector container of eight items
@li Line 5-7 creates a parallel execution graph using the method @c parallel_for
@li Line 9-10 names the synchronization tasks @c S and @c T
@li Line 12 dumps the graph to a dot format which can be visualized through @GraphVizOnline

Here is one possible output of this program:

@code{.sh}
S
139931471636224 runs 1
139931471636224 runs 2
139931480028928 runs 7
139931480028928 runs 8
139931496814336 runs 3
139931496814336 runs 4
139931488421632 runs 5
139931488421632 runs 6
T
@endcode

@section PartitionTheWorkloadExplicitly Partition the Workload Explicitly

By default, Cpp-Taskflow partitions the workload evenly across the workers.
In some cases, it is useful to disable this feature and apply user-specified partition.
The method @c parallel_for tasks an unsigned integer @c partitions as 
the number of partitions over the items.

@code{.cpp}
auto [S, T] = taskflow.parallel_for(items.begin(), items.end(), [] (int item) {
  std::cout << std::this_thread::get_id() << " runs " << item << std::endl;
}, 8);
@endcode

The above example will force each of the eight partitions to run exactly one item.
This can be useful when you have unbalanced workload per item
and would like to enable more efficient parallelization.

@image html images/parallel_for2.png width=100%

Again, you can leave the partition variable to 0 to use our default partition strategy.



@section ConstructTheGraphExplicitly Construct the Graph Explicitly

You can explicitly construct a dependency graph that represents a parallel execution 
of a for-loop using only the basic methods tf::Taskflow::emplace and tf::Task::precede.


@code{.cpp}
tf::Task S = taskflow.emplace([](){}).name("S");
tf::Task T = taskflow.emplace([](){}).name("T");

for(auto item : items) {
  tf::Task task = taskflow.emplace([item] () {
    std::cout << std::this_thread::get_id() << " runs " << item << std::endl;
  });
  S.precede(task);
  task.precede(T);
}
@endcode

@section IndexBasedForLoop Index-based For-loop

Cpp-Taskflow provides an overload 
tf::Taskflow::parallel_for(I beg, I end, I step, C&& callable, size_t partitions) 
to parallelize an index-based for-loop. 
It takes three numbers @c beg, @c end, and @c step of the same type @c I
and applies @c callable to each index in the range `[beg, end)` with 
the step size @c step.

@code{.cpp}
1: taskflow.parallel_for(0, 10, 2, [] (int i) {
2:   std::cout << "parallel on " << i << std::endl;
3: });
4: // print 0, 2, 4, 6, 8
@endcode


It also works on the opposite order with negative step size.

@code{.cpp}
1: taskflow.parallel_for(10, 0, -2, [] (int i) {
2:   std::cout << "parallel on " << i << std::endl;
3: });
4: // print 10, 8, 6, 4, 2
@endcode

Similarly, you can explicitly specify the partition size:

@code{.cpp}
1: taskflow.parallel_for(0, 10, 2, [] (int i) {
2:   std::cout << "parallel on " << i << std::endl;
3: }, 3);
4: // print 0, 2, 4, 6, 8 (three partitions {0, 2}, {4, 6}, {8})
@endcode

By default, Cpp-Taskflow performs even partition across the number of available threads 
if no partition size is given.


@section Chapter3Example1 Example 1: Parallel Map

This example demonstrates how to use tf::Taskflow::parallel_for
to create a parallel map pattern.
The map operator modifies each item in the container to one if it is an odd number,
or zero if it is an even number.

@code{.cpp}
#include <taskflow/taskflow.hpp>

int main() {
  
  tf::Executor executor;
  tf::Taskflow taskflow;

  std::vector<int> items{1, 2, 3, 4, 5, 6, 7, 8};

  taskflow.parallel_for(items.begin(), items.end(), [] (int& item) {
    item = (item & 1) ? 1 : 0;
  });

  executor.run(taskflow).get();

  for(auto item : items) {
    std::cout << item << " ";
  }

  return 0;
}
@endcode

The program outputs the following:

@code{.sh}
1 0 1 0 1 0 1 0 
@endcode


@section Chapter3Example2 Example 2: Pipeline a Parallel For-loop

This example demonstrates how to pipeline a parallel-for workload
with other tasks.

@code{.cpp}
 1: #include <taskflow/taskflow.hpp>
 2:
 3: int main() {
 4:
 5:   tf::Taskflow taskflow;
 6:
 7:   std::vector<int> items(1024);
 8:   std::atomic<int> sum {0};
 9:
10:   tf::Task T1 = taskflow.emplace([&] () {  // create a modifier task
11:     for(auto& item : items) {
12:       item = 1;
13:     }
14:   }).name("Create Items");
15:
16:   auto [S, T] = taskflow.parallel_for(items.begin(), items.end(), [&] (int item) {
17:     sum.fetch_add(item, std::memory_order_relaxed);
18:   }, 8);
19:
20:   tf::Task T2 = taskflow.emplace([&] () {  // create a output task
21:     std::cout << "sum is: " << sum << std::endl;
22:   }).name("Print Sum");
23:
24:   T1.precede(S);  // modifier precedes parallel-for
25:   T.precede(T2);  // parallel-for precedes the output task
26:
27:   tf::Executor().run(taskflow);
28:
29:   return 0;
30: }
@endcode


@image html images/parallel_for_example2.png width=100%


The output of this programs is:

@code{.sh}
sum is: 1024
@endcode

Debrief:
@li Line 5 creates a taskflow object
@li Line 7 creates a vector of 1024 uninitialized integers
@li Line 8 creates an atomic integer variable
@li Line 10-14 creates a task that captures the vector to initialize all items to one
@li Line 16-18 sums up all items with each thread running on a partition of 1024/8=128 items
@li Line 20-22 creates a task that outputs the summation value
@li Line 24-25 pipelines the parallel-for workload with the two tasks
@li Line 27 dispatches the graph to threads and blocks until the execution completes












Parallel tasks normally produce some quantity that needs to be combined or *reduced*
through particular operations, for instance, sum.
In this chapter, we are going to demonstrate how to use Cpp-Taskflow
to parallelize a reduction workload.


@section ReduceARangeOfItemsToASingleResult Reduce a Range of Items to a Single Result

tf::Taskflow::reduce(I beg, I end, T& result, B&& bop) 
is the most common reduction method to
create a task dependency graph that 
reduces a range of items to a single result through a binary operator.

@code{.cpp}
 1: tf::Taskflow taskflow;
 2:
 3: std::vector<int> items {1, 2, 3, 4, 5, 6, 7, 8};
 4: int sum {0};
 5:
 6: auto [S, T] = taskflow.reduce(items.begin(), items.end(), sum, [] (int a, int b) {
 7:   return a + b;
 8: });
 9:
10: S.name("S");
11: T.name("T");
12:
13: tf::Executor().run(taskflow).get();
14:
15: std::cout << "sum = " << sum << std::endl;    // 36
@endcode

Debrief:

@li Line 1 creates a taskflow object
@li Line 3 creates a vector of integers
@li Line 4 declares an integer variable @c sum and initializes it to zero
@li Line 6-8 constructs a reduction graph that sums up all integer itmes and stores the final result in @c sum
@li Line 10-11 names the two synchronization points
@li Line 13 dispatches the graph for execution
@li Line 15 prints out the final reduction value

The task dependency graph of this example is shown below:

@image html images/reduce1.png width=60%

Taskflow partitions and distributes the workload evenly across all workers
for all reduction methods.
In this example, each internal node sums up two integers and 
the target node @c T combine all results returned by the internal nodes to a single value.


@section TransformAndReduce Transform and Reduce

It is common to transform each item into a new data type and
then perform reduction on the transformed sequences.
Taskflow provides a method, 
tf::Taskflow::transform_reduce(I beg, I end, T& result, B&& bop, U&& uop), 
that fuses these two operators together to save memory reads and writes.
The example below takes a string and transforms each digit to an integer number,
and then applies reduction to sum up all integer numbers.

@code{.cpp}
 1: std::string str = "12345678";
 2: int sum {0};
 3:
 4: auto [S, T] = taskflow.transform_reduce(str.begin(), str.end(), sum,
 5:   [] (int a, int b) {
 6:     return a + b;
 7:   },  
 8:   [] (char c) -> int {
 9:     return c - '0';
10:   }   
11: ); 
12:
13: // sum will be 36 after execution
@endcode

Debrief:

@li Line 1 creates a string of eight digits
@li Line 2 declares an integer variables and initializes it to zero
@li Line 4-11 constructs a reduction graph that converts each character of the string into an integer and computes the sum of all integers

Taskflow has another overload 
tf::Taskflow::transform_reduce(I beg, I end, T& result, B&& bop1, P&& bop2, U&& uop) 
that takes an additional binary operator @c bop2 
to combine the result of @c uop and the dereferencing of an input iterator
to a data type that is acceptable as input to @c bop1.
This is useful when extra computation is required during the reduction process.

@code{.cpp}
 1: std::string str = "12345678";
 2:
 3: double sum {0};
 4:
 5: auto [S, T] = taskflow.transform_reduce(str.begin(), str.end(), sum,
 6:   [] (double a, double b) {
 7:     return a + b;
 8:   },  
 9:   [] (double a, char c) -> double {
10:     return a + (c - '0');
11:   },  
12:   [] (char c) -> double {
13:     return static_cast<double>(c - '0');
14:   }   
15: );  
16:
17: // sum will be 36 after execution
@endcode

Debrief:

@li Line 1 creates a string of eight digits
@li Line 3 declares an integer variable and initializes it to zero
@li Line 5 constructs a reduction graph that represents the reduction workload
@li Line 6-8 takes a binary operator to combine two transformed data
@li Line 9-11 takes a binary operator to combine one raw data together with a transformed data
@li Line 12-14 takes a unary operator to transform one raw data to the reduced data type 

The difference between the two overloads appears in the second binary operator.
Instead of converting every item to the reduced data type,
this binary operator provides a more fine-grained control over reduction.


@section Chapter4Example1 Example 1: Find the Min/Max Element

One common workload of using reduce is to find the minimum or the maximum
element in a range of items.
This example demonstrates how to use the method tf::Taskflow::reduce to find the 
minimum element in a data set.

@code{.cpp}
 1: #include <taskflow/taskflow.hpp>
 2:
 3: int main() {
 4:
 5:   tf::Taskflow taskflow;
 6:
 7:   std::vector<int> items {4, 2, 1, 3, 7, 8, 6, 5};
 8:   int min = std::numeric_limits<int>::max();
 9:
10:   taskflow.reduce(items.begin(), items.end(), min, [] (int a, int b) {
11:     return std::min(a, b);
12:   });
13:
14:   tf::Executor().run(taskflow).get();
15:
16:   std::cout << min << std::endl;  // 1
17:
18:   return 0;
19: }
@endcode

Similarly, the example below finds the maximum element in a data set.

@code{.cpp}
 1: #include <taskflow/taskflow.hpp>
 2:
 3: int main() {
 4:
 5:   tf::Taskflow taskflow;
 6:
 7:   std::vector<int> items {4, 2, 1, 3, 7, 8, 6, 5};
 8:   int max = std::numeric_limits<int>::min();
 9:
10:   taskflow.reduce(items.begin(), items.end(), max, [] (int a, int b) {
11:     return std::max(a, b);
12:   });
13:
14:   tf::Executor().run(taskflow).get();
15:
16:   std::cout << max << std::endl;  // 8
17:
18:   return 0;
19: }
@endcode


@section Chapter4Example2 Example 2: Pipeline a Reduction Graph

The tf::Taskflow::reduce method returns a task pair as two synchronization points 
of the reduction graph
which can be used to pipeline with other tasks.
The example below demonstrates how to pipeline a reduction graph with other tasks.

@code{.cpp}
 1: #include <taskflow/taskflow.hpp>
 2:
 3: int main() {
 4:
 5:   tf::Taskflow taskflow;
 6:
 7:   std::vector<int> items{1024};
 8:   int min = std::numeric_limits<int>::max();
 9:
10:   tf::Task T1 = taskflow.emplace([&] () {  // modifier task
11:     for(auto& item : items) {
12:       item = ::rand();
13:     }
14:   });
15:
16:   auto [S, T] = taskflow.reduce(items.begin(), items.end(), min, [] (int a, int b) {
17:     return std::min(a, b);
18:   });
19:
20:   tf::Task T2 = taskflow.emplace([&] () {  // printer task
21:     std::cout << "min is " << min << std::endl;
22:   });
23:
24:   T1.precede(S);    // modifier task precedes the reducer
25:   T.precede(T2);    // reducer precedes the printer task
26:
27:   tf::Executor().run(taskflow);
28:
29:   return 0;
30: }
@endcode

Debrief:
@li Line 5 creates a taskflow object with four worker threads
@li Line 7 creates a vector of 1024 uninitialized integers
@li Line 8 creates an integer value and initializes it to the maximum representable value of @c int
@li Line 10-14 creates a modifier task that initializes the vector to random integer values
@li Line 16-18 creates a reduction graph to find the minimum element in the vector
@li Line 20-22 creates a task that prints the minimum value found after the reduction finishes
@li Line 24-25 adds two dependency links to implement our control flow
@li Line 27 dispatches the task dependency graph to threads and waits until the execution completes

In the reduction graph,
each worker thread applies the give reduce operator
to a partition of 1024/4 = 512 items.
The final minimum value is stored in the variable @c min.
Since the variable @c min participates in the reduction process,
it is users' responsibility to initialize it to a proper value.

@section Chapter4Example3 Example 3: Find the Minimum L1-norm

The example below applies the method tf::Taskflow::transform_reduce
to find the minimum L1-norm out of a point set.

@code{.cpp}
 1: #include <taskflow/taskflow.hpp>
 2: 
 3: struct Point {
 4:   int x1 {::rand() % 10 - 5};   // random value
 5:   int x2 {::rand() % 10 - 5};   // random value
 6: };
 7: 
 8: int main() {
 9: 
10:   tf::Taskflow taskflow;
11: 
12:   std::vector<Point> points {1024};
13:   int min = std::numeric_limits<int>::max();
14: 
15:   taskflow.transform_reduce(points.begin(), points.end(), min,
16:     [] (int a, int b) {
17:       return std::min(a, b);
18:     },
19:     [] (const Point& point) -> int {    // Find the L1-norm of the point
20:       return std::abs(point.x1) + std::abs(point.x2);
21:     }
22:   );
23: 
24:   tf::Executor().run(taskflow).get();
25: 
26:   return 0;
27: }
@endcode

Debrief:
@li Line 3-6 declares a point struct at 2D plane
@li Line 10 creates a taskflow object with four worker threads
@li Line 12 creates a vector of 1024 randomly initialized points
@li Line 13 creates an integer value and initializes it to the maximum representable value of @c int
@li Line 15-22 applies transform-reduce operation to find the point with the minimum L1 distance
@li Line 24 dispatches the task dependency graph to threads and waits until the execution completes

