Learning from Examples » Matrix Multiplication

We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it using CPU and GPU parallel computing.

Problem Formulation

We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix.

Image

As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them.

Image

We can implement matrix multiplication using three nested loops.

for(int m=0; m<M; m++) {
  for(int n=0; n<N; n++) {
    C[m][n] = 0;
    for(int k=0; k<K; k++) {
      C[m][n] += A[m][k] * B[k][n];
    }
  }
}

Parallel Patterns

At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C to multiply a row of A by every column of B.

// C = A * B
// A is a MxK matrix, B is a KxN matrix, and C is a MxN matrix
void matrix_multiplication(int** A, int** B, int** C, int M, int K, int N) {

  tf::Taskflow taskflow;
  tf::Executor executor;
  
  for(int m=0; m<M; ++m) {
    taskflow.emplace([m, &] () {
      for(int n=0; n<N; n++) {
        for(int k=0; k<K; k++) {
          C[m][n] += A[m][k] * B[k][n];
        }
      }
    });
  }

  executor.run(taskflow).wait();
}

Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::for_each_index to create a parallel-for task. A parallel-for task spawns a subflow to perform parallel iterations over the given range.

// perform parallel iterations on the range [0, M) with the step size of 1
tf::Task task = taskflow.for_each_index(0, M, 1, [&] (int m) {
  for(int n=0; n<N; n++) {
    for(int k=0; k<K; k++) {
      C[m][n] += A[m][k] * B[k][n];
    }   
  }   
});

Please visit Parallel Iterations for more details.

GPU-based Acceleration

GPU is able to do a lot of parallel computations more than CPUs. It is especially useful for data-intensive computing such as matrix multiplication. With GPU, we express the parallel patterns at a fine-grained level. The kernel, written in CUDA, is described as follows:

// CUDA kernel to perform matrix multiplication
__global__ void matmul(int *A, int *B, int *C, int M, int K, int N) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  int sum = 0;
  if(col < N && row < M) {
    for(int i = 0; i < K; i++) {
      sum += a[row * K + i] * b[i * N + col];
    }
    c[row * N + col] = sum;
  }
}

Each CUDA thread corresponds to an element of C and compute its result. Instead of storing each matrix in a 2D array, we use 1D layout to ease the data transfer between CPU and GPU. In a row-major layout, an element (x, y) in the 2D matrix can be addressed at x * width + y in the transformed 1D layout.

Image

The next step is to allocate memory for A, B, and C at a GPU. We create three tasks each calling cudaMalloc to allocate space for one matrix. Then, we create a cudaFlow to offload matrix multiplication to a GPU. The entire code is described as follows:

void matrix_multiplication(int* A, int* B, int* C, int M, int K, int N) {
  
  tf::Taskflow taskflow;
  tf::Executor executor;

  // allocate the host and gpu storage for A
  tf::Task allocate_a = taskflow.emplace([&](){
    cudaMalloc(&da, M*K*sizeof(int));
  }).name("allocate_a");
  
  // allocate the host and gpu storage for B
  tf::Task allocate_b = taskflow.emplace([&](){
    cudaMalloc(&db, K*N*sizeof(int));
  }).name("allocate_b");
  
  // allocate the host and gpu storage for C
  tf::Task allocate_c = taskflow.emplace([&](){
    cudaMalloc(&dc, M*N*sizeof(int));
  }).name("allocate_c");
  
  // create a cudaFlow to run the matrix multiplication
  tf::Task cudaFlow = taskflow.emplace([&](tf::cudaFlow& cf){
  
    // copy data to da, db, and dc
    tf::cudaTask copy_da = cf.copy(da, A, M*K).name("H2D_A");
    tf::cudaTask copy_db = cf.copy(db, B, K*N).name("H2D_B");
    tf::cudaTask copy_hc = cf.copy(C, dc, M*N).name("D2H_C");
  
    dim3 grid  ((K+16-1)/16, (M+16-1)/16);
    dim3 block (16, 16);
  
    tf::cudaTask kmatmul = cf.kernel(grid, block, 0, matmul, da, db, dc, M, K, N)
                             .name("matmul");
  
    kmatmul.succeed(copy_da, copy_db)
           .precede(copy_hc);
  
  }).name("cudaFlow");
  
  // free the gpu storage
  auto free = taskflow.emplace([&](){
    cudaFree(da);
    cudaFree(db);
    cudaFree(dc);
  }).name("free");
  
  // create dependency
  cudaFlow.succeed(allocate_a, allocate_b, allocate_c)
          .precede(free);
  
  // dump the graph without unfolding the cudaFlow
  taskflow.dump(std::cout);

  // run the taskflow
  executor.run(taskflow).wait();

  // dump the entire execution graph including unfolded cudaFlow
  taskflow.dump(std::cout);
}

Within the cudaFlow, we create two host-to-device (H2D) tasks that copy data from A and B to da and db, one device-to-host (D2H) task that copies the result from dc to C, and one kernel task that launches matmul on the GPU (by default, GPU 0). H2D tasks precede the kernel and the kernel precedes the D2H task. These GPU operations form a GPU task graph managed by a cudaFlow. The first dump of the taskflow gives the following graph:

Taskflow p0x55d923794f10 allocate_a p0x55d923795240 cudaFlow p0x55d923794f10->p0x55d923795240 p0x55d923795350 free p0x55d923795240->p0x55d923795350 p0x55d923795020 allocate_b p0x55d923795020->p0x55d923795240 p0x55d923795130 allocate_c p0x55d923795130->p0x55d923795240

A cudaFlow encapsulates a GPU task dependency graph similar to a subflow (see Dynamic Tasking). In order to visualize it, we need to execute the graph first and then dump the taskflow.

Taskflow cluster_p0x5558af971240 cudaFlow: cudaFlow p0x5558af970f10 allocate_a p0x5558af971240 cudaFlow p0x5558af970f10->p0x5558af971240 p0x5558af971350 free p0x5558af971240->p0x5558af971350 p0x5558af971020 allocate_b p0x5558af971020->p0x5558af971240 p0x5558af971130 allocate_c p0x5558af971130->p0x5558af971240 p0x7f6fd8000b20 H2D_a p0x7f6fd8000db0 matmul p0x7f6fd8000b20->p0x7f6fd8000db0 p0x7f6fd8000ce0 D2H_c p0x7f6fd8000db0->p0x7f6fd8000ce0 p0x7f6fd8000c00 H2D_b p0x7f6fd8000c00->p0x7f6fd8000db0 p0x7f6fd8000ce0->p0x5558af971240

Benchmarking

We run three versions of matrix multiplication, sequential CPU, parallel CPUs, and one GPU, on a machine of 6 Intel i7-8700 CPUs at 3.20GHz and a Nvidia RTX 2080 GPU using various matrix sizes of A, B, and C.

ABCCPU SequentialCPU ParallelGPU Parallel
10x1010x1010x100.142 ms0.414 ms82 ms
100x100100x100100x1001.641 ms0.733 ms83 ms
1000x10001000x10001000x10001532 ms504 ms85 ms
2000x20002000x20002000x200025688 ms4387 ms133 ms
3000x30003000x30003000x3000104838 ms16170 ms214 ms
4000x40004000x40004000x4000250133 ms39646 ms427 ms

With the matrix size going up to 1000, the speed-up of GPU over CPUs becomes prominent.