Taskflow
2.7.0
|
We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it using CPU and GPU parallel computing.
We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix.
As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them.
We can implement matrix multiplication using three nested loops.
At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C to multiply a row of A by every column of B.
Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::parallel_for to create a parallel-for task. A parallel-for task spawns a subflow to perform parallel iterations over the given range.
Please visit A0: Parallel Iterations for more details.
GPU is able to do a lot of parallel computations more than CPUs. It is especially useful for data-intensive computing such as matrix multiplication. With GPU, we express the parallel patterns at a fine-grained level. The kernel, written in CUDA, is described as follows:
Each CUDA thread corresponds to an element of C and compute its result. Instead of storing each matrix in a 2D array, we use 1D layout to ease the data transfer between CPU and GPU. In a row-major layout, an element (x, y)
in the 2D matrix can be addressed at x * width + y
in the transformed 1D layout.
The next step is to allocate memory for A, B, and C at a GPU. We create three tasks each calling cudaMalloc
to allocate space for one matrix. Then, we create a cudaFlow to offload matrix multiplication to a GPU. The entire code is described as follows:
Within the cudaFlow, we create two host-to-device (H2D) tasks that copy data from A
and B
to da
and db
, one device-to-host (D2H) task that copies the result from dc
to C
, and one kernel task that launches matmul
on the GPU (by default, GPU 0). H2D tasks precede the kernel and the kernel precedes the D2H task. These GPU operations form a GPU task graph managed by a cudaFlow. The first dump of the taskflow gives the following graph:
A cudaFlow encapsulates a GPU task dependency graph similar to a subflow (see C3: Dynamic Tasking). In order to visualize it, we need to execute the graph first and then dump the taskflow.
We run three versions of matrix multiplication, sequential CPU, parallel CPUs, and one GPU, on a machine of 6 Intel i7-8700 CPUs at 3.20GHz and a Nvidia RTX 2080 GPU using various matrix sizes of A, B, and C.
A | B | C | CPU Sequential | CPU Parallel | GPU Parallel |
---|---|---|---|---|---|
10x10 | 10x10 | 10x10 | 0.142 ms | 0.414 ms | 82 ms |
100x100 | 100x100 | 100x100 | 1.641 ms | 0.733 ms | 83 ms |
1000x1000 | 1000x1000 | 1000x1000 | 1532 ms | 504 ms | 85 ms |
2000x2000 | 2000x2000 | 2000x2000 | 25688 ms | 4387 ms | 133 ms |
3000x3000 | 3000x3000 | 3000x3000 | 104838 ms | 16170 ms | 214 ms |
4000x4000 | 4000x4000 | 4000x4000 | 250133 ms | 39646 ms | 427 ms |
With the matrix size going up to 1000, the speed-up of GPU over CPUs becomes prominent.