Taskflow  2.4-master-branch
Matrix Multiplication

We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it using CPU and GPU parallel computing.

Problem Formulation

We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix.

matrix_multiplication_1.png

As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them.

matrix_multiplication_2.png

We can implement matrix multiplication using three nested loops.

for(int m=0; m<M; m++) {
for(int n=0; n<N; n++) {
C[m][n] = 0;
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
}

Parallel Patterns

At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. We create a task per row of C to multiply a row of A by every column of B.

// C = A * B
// A is a MxK matrix, B is a KxN matrix, and C is a MxN matrix
void matrix_multiplication(int** A, int** B, int** C, int M, int K, int N) {
tf::Taskflow taskflow;
tf::Executor executor;
// loop the rage [0, M) with a step size of one
taskflow.parallel_for(0, M, 1, [&] (int m) {
for(int n=0; n<N; n++) {
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
});
executor.run(taskflow).wait();
}

The method, Taskflow::parallel_for, creates a parallel-for graph. The above example creates a task per row m of C that applies a lambda to compute the result of each m. The first three arguments define the loop in the range [0, M) with a step size of one. You can force each task to work on a chunk of iterations by passing an addition argument at last.

unsigned P = 8; // eight partitioned tasks
// loop the rage [0, M) with a step size of one and have each task run "chunk" iterations
std::pair<tf::Task, tf::Task> pair = taskflow.parallel_for(0, M, 1, [&] (int m) {
for(int n=0; n<N; n++) {
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
}, (M + P - 1) / P);
// name tasks
pair.first.name("S");
pair.second.name("T");
pair.first.for_each_successor([i=0] (tf::Task task) mutable {
task.name(std::string("chunk") + std::to_string(i++));
});
// dump the taskflow to a dot format
taskflow.dump(std::cout);
matrix_multiplication_3.svg

The above code partitions the loop into eight taskseach containing up to (M + P - 1) / P rows of C. The return value of Taskflow::parallel_for is a std::pair of tasks, indicating the starting point and the ending point of the parallel-for graph. You can use this pair of tasks to splice the parallel-for graph to other dependent tasks. Instead of using Taskflow::parallel_for, you can manually create a parallel-for graph.

tf::Task S, T;
for(int m=0; m<M; ++m) {
auto chunk = taskflow.emplace([m, &] () {
for(int n=0; n<N; n++) {
for(int k=0; k<K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
});
chunk.succeed(S)
.precede(T);
}

Depending on applications, the chunk size of each task can have significant impact on the performance. Smaller chunk size implies more tasks and thus the overhead of creating and scheduling tasks may slow down the performance. Larger chunk size implies fewer tasks and load balancing between tasks dominates the performance.

GPU-based Acceleration

GPU is able to do a lot of parallel computations more than CPUs. It is especially useful for data-intensive computing such as matrix multiplication. With GPU, we express the parallel patterns at a fine-grained level. The kernel, written in CUDA, is described as follows:

// CUDA kernel to perform matrix multiplication
__global__ void matmul(int *A, int *B, int *C, int M, int K, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int sum = 0;
if(col < N && row < M) {
for(int i = 0; i < K; i++) {
sum += a[row * K + i] * b[i * N + col];
}
c[row * N + col] = sum;
}
}

Each CUDA thread corresponds to an element of C and compute its result. Instead of storing each matrix in a 2D array, we use 1D layout to ease the data transfer between CPU and GPU. In a row-major layout, an element (x, y) in the 2D matrix can be addressed at x * width + y in the transformed 1D layout.

matrix_multiplication_4.png

The next step is to allocate memory for A, B, and C at a GPU. We create three tasks each calling cudaMalloc to allocate space for one matrix. Then, we create a cudaFlow to offload matrix multiplication to a GPU. The entire code is described as follows:

void matrix_multiplication(int* A, int* B, int* C, int M, int K, int N) {
tf::Taskflow taskflow;
tf::Executor executor;
// allocate the host and gpu storage for A
tf::Task allocate_a = taskflow.emplace([&](){
cudaMalloc(&da, M*K*sizeof(int));
}).name("allocate_a");
// allocate the host and gpu storage for B
tf::Task allocate_b = taskflow.emplace([&](){
cudaMalloc(&db, K*N*sizeof(int));
}).name("allocate_b");
// allocate the host and gpu storage for C
tf::Task allocate_c = taskflow.emplace([&](){
cudaMalloc(&dc, M*N*sizeof(int));
}).name("allocate_c");
// create a cudaFlow to run the matrix multiplication
tf::Task cudaFlow = taskflow.emplace([&](tf::cudaFlow& cf){
// copy data to da, db, and dc
tf::cudaTask copy_da = cf.copy(da, A, M*K).name("H2D_A");
tf::cudaTask copy_db = cf.copy(db, B, K*N).name("H2D_B");
tf::cudaTask copy_hc = cf.copy(C, dc, M*N).name("D2H_C");
dim3 grid ((K+16-1)/16, (M+16-1)/16);
dim3 block (16, 16);
tf::cudaTask kmatmul = cf.kernel(grid, block, 0, matmul, da, db, dc, M, K, N)
.name("matmul");
kmatmul.succeed(copy_da, copy_db)
.precede(copy_hc);
}).name("cudaFlow");
// free the gpu storage
auto free = taskflow.emplace([&](){
cudaFree(da);
cudaFree(db);
cudaFree(dc);
}).name("free");
// create dependency
cudaFlow.succeed(allocate_a, allocate_b, allocate_c)
.precede(free);
// dump the graph without unfolding the cudaFlow
taskflow.dump(std::cout);
// run the taskflow
executor.run(taskflow).wait();
// dump the entire execution graph including unfolded cudaFlow
taskflow.dump(std::cout);
}

Within the cudaFlow, we create two host-to-device (H2D) tasks that copy data from A and B to da and db, one device-to-host (D2H) task that copies the result from dc to C, and one kernel task that launches matmul on the GPU (by default, GPU 0). H2D tasks precede the kernel and the kernel precedes the D2H task. These GPU operations form a GPU task graph managed by a cudaFlow. The first dump of the taskflow gives the following graph:

matrix_multiplication_5.svg

A cudaFlow encapsulates a GPU task dependency graph similar to a subflow (see C3: Dynamic Tasking). In order to visualize it, we need to execute the graph first and then dump the taskflow.

matrix_multiplication_6.svg

Benchmarking

We run three versions of matrix multiplication, sequential CPU, parallel CPUs, and one GPU, on a machine of 6 Intel i7-8700 CPUs at 3.20GHz and a Nvidia RTX 2080 GPU using various matrix sizes of A, B, and C.

A B C CPU Sequential CPU Parallel GPU Parallel
10x10 10x10 10x10 0.142 ms 0.414 ms 82 ms
100x100 100x100 100x100 1.641 ms 0.733 ms 83 ms
1000x1000 1000x1000 1000x1000 1532 ms 504 ms 85 ms
2000x2000 2000x2000 2000x2000 25688 ms 4387 ms 133 ms
3000x3000 3000x3000 3000x3000 104838 ms 16170 ms 214 ms
4000x4000 4000x4000 4000x4000 250133 ms 39646 ms 427 ms

With the matrix size going up to 1000, the speed-up of GPU over CPUs becomes prominent.