cudaFlow Algorithms » Linear Algebra with cuBLAS

Taskflow provides a library tf::cublasFlowCapturer to program and accelerate basic linear algebra subprograms (BLAS) on top of cuBLAS.

Image

What is BLAS?

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. There are three levels:

  1. Level 1: performs scalar, vector, and vector-vector operations
  2. Level 2: performs matrix-vector operations
  3. Level 3: performs matrix-matrix operations

BLAS is commonly used by linear algebra software. The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the Nvidia CUDA runtime and it allows users to access the computational resources of Nvidia GPUs.

What is a cublasFlow Capturer? Why?

tf::cublasFlowCapturer provides an interface over native cuBLAS functions and allows users to express linear algebra algorithms using a task graph model. We transform the task graph into a CUDA graph using a stream capture algorithm optimized for maximum concurrency. When a cuBLAS program is transformed into a CUDA graph, we can launch the entire graph using a single kernel call. This organization minimizes kernels launch overhead and allows the CUDA runtime to optimize the whole workflow. The following example (cublasflow.cu) use tf::cublasFlowCapturer to perform a 2-norm operation on a vector.

#include <taskflow/cublasflow.hpp>

int main() {

  const int N = 1024;
  
  tf::Executor executor;
  tf::Taskflow taskflow("cublas 2-norm");  // initialize an unit vector
  
  std::vector<float> hvec(N, 1); 
  float  hres;                                     // cpu vector
  float* gvec = tf::cuda_malloc_device<float>(N);  // gpu vector
  float* gres = tf::cuda_malloc_device<float>(1);  // gpu result
  
  taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
    // create a cuBLAS flow capturer
    auto blas = capturer.make_capturer<tf::cublasFlowCapturer>();
    
    tf::cudaTask h2d = capturer.copy(gvec, hvec.data(), N).name("h2d");
    tf::cudaTask nrm = blas->nrm2(N, gvec, 1, gres).name("2-norm");
    tf::cudaTask d2h = capturer.copy(&hres, gres, 1).name("d2h");
  
    nrm.precede(d2h)
       .succeed(h2d);
  }).name("capturer");
  
  executor.run(taskflow).wait();
  
  taskflow.dump(std::cout);
  
  assert(hres == 32);
}
Taskflow cluster_p0x7ffe27cbbb00 Taskflow: cublas 2-norm cluster_p0x190c880 cudaFlow: capturer p0x190c880 capturer p0x7f05aa4b43b0 h2d p0x7f05aa4b4470 2-norm p0x7f05aa4b43b0->p0x7f05aa4b4470 p0x7f05aa4b4540 d2h p0x7f05aa4b4470->p0x7f05aa4b4540 p0x7f05aa4b4540->p0x190c880

You need to link the cublas library when compiling a cublasFlow capturer program:

~$ nvcc cublasflow.cpp -I path/to/taskflow/include -lcublas

Please refer to the page Compile Taskflow with CUDA for more details about compiling Taskflow with CUDA.

Understand the Data Model

The data pointers used within tf::cublasFlowCapturer must sit in GPU memory space, including scalar pointers (alpha and beta), input pointers (e.g., vectors, matrices), and output pointers (e.g., result). By default, we set the pointer mode to CUBLAS_POINTER_MODE_DEVICE. You must allocate required matrices and vectors in the GPU memory space, fill them with data, call the methods defined in tf::cublasFlowCapturer, and then upload the results from GPU memory space back to the host.

The cuBLAS library uses column-major storage and 1-based indexing. Since C/C++ adopts row-major layout, we cannot use the native array semantics when matrix-matrix or matrix-vector operations are involved. We often need extra transposition on input matrices before these operations can take place correctly. In terms of storage, a row-major matrix is equivalent to a transposed column-major matrix, as shown below:

\[ A_{RowMajor} \iff A^T_{ColumnMajor} \]

Example: Transform Data Layout in Matrix Multiplication

Suppose we have a method matmul(A, B, C) that multiplies two matrices A and B and stores the result in C, using column-major storage. In C/C++, data layout is mostly row-major. Since we know a row-major matrix is equivalent in storage to a transposed column-major matrix, we can take a transposed view of this multiplication:

\[ C^T = B^T \times A^T \]

If the given matrices A, B, and C are on row-major layout, calling matmul(A, B, C) is equivalent to the above transposed version. The function stores the result of transposed C in column-major storage which in turns translates to row-major layout of Cour desired solution.

Use Level-1 Methods

We currently support the following level-1 methods:

Our level-1 methods capture the native cublas level-1 calls with internal stream(s) optimized for maximum concurrency. The two scalars, alpha and beta, and input and output matrices must sit in GPU memory space.

Use Level-2 Methods

We currently support the following level-2 methods:

Our level-2 methods capture the native cublas level-2 calls with internal stream(s) optimized for maximum concurrency. The two scalars, alpha and beta, and input and output matrices must sit in GPU memory space.

Example: Solve a Triangular Linear System

The following program solves a triangular linear system on row-major layout using tf::cublasFlowCapturer::c_trsv:

#include <taskflow/cublasflow.hpp>

int main() {

  int N = 3;

  const std::vector<float> hA = {
    1, 0, 0,  // x1
    1, 1, 0,  // x1 + x2
    1, 1, 1   // x1 + x2 + x3
  };

  const std::vector<float> hB = {
    5,        // x1           = 5
    4,        // x1 + x2      = 4
    7         // x1 + x2 + x3 = 7
  };

  std::vector<float> r(N, 0);

  tf::Taskflow taskflow("Ax = b");
  tf::Executor executor;

  float* dA = tf::cuda_malloc_device<float>(hA.size());
  float* dB = tf::cuda_malloc_device<float>(hB.size());

  taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
    tf::cudaTask blas = capturer.make_capturer<tf::cublasFlowCapturer>();
    tf::cudaTask h2dA = blas->copy(dA, hA.data(), hA.size()).name("copy A");
    tf::cudaTask h2dB = blas->copy(dB, hB.data(), hB.size()).name("copy B");
    tf::cudaTask trsv = blas->c_trsv(
      CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_N, CUBLAS_DIAG_UNIT, 
      N, dA, N, dB, 1
    ).name("trsv");
    tf::cudaTask d2h = blas->copy(r.data(), dB, r.size()).name("copy result");

    trsv.succeed(h2dA, h2dB)
        .precede(d2h);
  }).name("cublasFlow");

  executor.run(taskflow).wait();
  
  std::cout << "solution of the linear system: \n";
  for(size_t i=0; i<r.size(); ++i) {
    std::cout << "x" << i << ": " << r[i] << '\n';
  }

  return 0;
}

The program uses one cudaFlow task that spawns a capturer of (1) two copy tasks, h2dA and h2dB, to copy the triangular matrix A and the solution vector b, (2) one kernel task trsv to solve the triangular linear system storing the result in dB, and (3) one copy task d2h to copy the solution in dB to r.

Taskflow cluster_p0x7ffea58711b0 Taskflow: Ax = b cluster_p0x18c01a0 cudaFlow: cublasFlow p0x18c01a0 cublasFlow p0x7f7e06617130 copy A p0x7f7e066172d0 trsv p0x7f7e06617130->p0x7f7e066172d0 p0x7f7e066173d0 copy result p0x7f7e066172d0->p0x7f7e066173d0 p0x7f7e06617200 copy B p0x7f7e06617200->p0x7f7e066172d0 p0x7f7e066173d0->p0x18c01a0

Use Level-3 Methods

We currently support the following level-3 methods:

Our level-3 methods capture the native cublas level-3 calls and cublas-extension calls with internal stream(s) optimized for maximum concurrency. The two scalars, alpha and beta, and input and output matrices must sit in GPU memory space.

Example: Perform General Matrix-Matrix Multiplication

The following program performs general matrix multiplication on row-major layout using tf::cublasFlowCapturer::c_gemm:

#include <taskflow/cublasflow.hpp>

int main() {

  const int M = 2, N = 4, K = 3;

  const std::vector<float> hA = {      // M x K matrix
    11, 12, 13, 
    14, 15, 16
  };

  const std::vector<float> hB = {      // K x N matrix
    11, 12, 13, 14,
    15, 16, 17, 18,
    19, 20, 21, 22
  };

  const std::vector<float> golden = {  // M x N matrix
    548, 584, 620, 656,
    683, 728, 773, 818 
  };

  std::vector<float> hC(M*N);
    
  float *dA, *dB, *dC, *dAlpha, *dBeta;

  tf::Taskflow taskflow("Matrix Multiplication");
  tf::Executor executor;

  tf::Task malloc_dA = taskflow.emplace(
    [&](){ dA = tf::cuda_malloc_device<float>(hA.size()); }
  ).name("malloc_dA");      // allocate GPU memory for dA
  
  tf::Task malloc_dB = taskflow.emplace(
    [&](){ dB = tf::cuda_malloc_device<float>(hB.size()); }
  ).name("malloc_dB");      // allocate GPU memory for dB
  
  tf::Task malloc_dC = taskflow.emplace(
    [&](){ dC = tf::cuda_malloc_device<float>(hC.size()); }
  ).name("malloc_dC");      // allocate GPU memory for dC
  
  tf::Task malloc_dAlpha = taskflow.emplace(
    [&](){ dAlpha = tf::cuda_malloc_device<float>(1); }
  ).name("malloc_dAlpha");  // allocate GPU memory for scalar alpha
  
  tf::Task malloc_dBeta = taskflow.emplace(
    [&](){ dBeta = tf::cuda_malloc_device<float>(1); }
  ).name("malloc_dBeta");   // allocate GPU memory for scalar beta

  tf::Task cublasFlow = taskflow.emplace([&](tf::cudaFlowCapturer& capturer) {
    tf::cudaTask blas  = capturer.make_capturer<tf::cublasFlowCapturer>();
    tf::cudaTask alpha = blas->single_task([=] __device__ () { *dAlpha = 1; })
                              .name("alpha=1");
    tf::cudaTask beta  = blas->single_task([=] __device__ () { *dBeta  = 0; })
                              .name("beta=0");
    tf::cudaTask copyA = blas->copy(dA, hA.data(), hA.size()).name("copyA"); 
    tf::cudaTask copyB = blas->copy(dB, hB.data(), hB.size()).name("copyB");
    tf::cudaTask gemm  = blas->c_gemm(CUBLAS_OP_N, CUBLAS_OP_N,
      M, N, K, dAlpha, dA, K, dB, N, dBeta, dC, N
    ).name("C = alpha * A * B + beta * C");
    tf::cudaTask copyC = blas->copy(hC.data(), dC, hC.size()).name("copyC");

    gemm.succeed(alpha, beta, copyA, copyB)
        .precede(copyC);
  }).name("cublasFlow");

  cublasFlow.succeed(  // cublasFlow runs after GPU memory operations
    malloc_dA, malloc_dB, malloc_dC, malloc_dAlpha, malloc_dBeta
  );

  executor.run(taskflow).wait();
  
  std::cout << "Matrix C:\n";
  for(int m=0; m<M; m++) {
    for(int n=0; n<N; n++) {
      std::cout << hC[m*N+n] << ' ';
    }
    std::cout << '\n';
  }

  return 0;
}

The program uses five static tasks to allocate GPU memory for dA, dB, dC, dAlpha, and dBeta, in parallel such that the expensive GPU memory operations can overlap with each other as much as possible. The cublasFlow task spawns a cublasFlow capturer of (1) one single kernel task alpha to set dAlpha to 1, (2) one single kernel task beta to set dBeta to 0, (3) two copy tasks, copyA and copyB, to copy data from CPU to GPU, (4) one kernel task gemm to perform dC = dA * dB, and (5) one copy task copyC to copy the result from GPU to CPU.

Taskflow cluster_p0x7fff9680d710 Taskflow: Matrix Multiplication cluster_p0xe0e598 cudaFlow: cublasFlow p0xe0e200 malloc_dA p0xe0e598 cublasFlow p0xe0e200->p0xe0e598 p0xe0e2b8 malloc_dB p0xe0e2b8->p0xe0e598 p0xe0e370 malloc_dC p0xe0e370->p0xe0e598 p0xe0e428 malloc_dAlpha p0xe0e428->p0xe0e598 p0xe0e4e0 malloc_dBeta p0xe0e4e0->p0xe0e598 p0x7f4711d8d260 alpha=1 p0x7f4711d8d510 C = alpha * A * B + beta * C p0x7f4711d8d260->p0x7f4711d8d510 p0x7f4711d8d620 copyC p0x7f4711d8d510->p0x7f4711d8d620 p0x7f4711d8d310 beta=0 p0x7f4711d8d310->p0x7f4711d8d510 p0x7f4711d8d3c0 copyA p0x7f4711d8d3c0->p0x7f4711d8d510 p0x7f4711d8d480 copyB p0x7f4711d8d480->p0x7f4711d8d510 p0x7f4711d8d620->p0xe0e598

Include Other cuBLAS Methods

We do not include all the cuBLAS functions but users can easily make extension. tf::cublasFlowCapturer is derived from tf::cudaFlowCapturerBase and is created from a factory interface tf::cudaFlowCapturer::make_capturer. Each tf::cudaFlowCapturerBase object has a pointer accessible by tf::cudaFlowCapturerBase::factory in which you can use tf::cudaFlowCapturer::on together with tf::cublasFlowCapturer::native_handle to capture other cuBLAS functions that are currently not available in tf::cudaFlowCapturer. The following example captures the Hermitian rank-k update using cublasCherkx.

taskflow.emplace([&](tf::cudaFlowCapturer& capturer){
  // create a cublasFlow capturer
  auto blas = capturer.make_capturer<tf::cublasFlowCapturer>();

  // use the base method tf::cudaFlowCapturer::on to capture other functions
  blas->factory()->on([&](cudaStream_t stream){
    cublasSetStream(blas->native_handle(), stream);
    cublasCherkx(blas->native_handle(), your_args...);
  }).name("Hermitian rank-k update");

}).name("capturer");

By default, we associate the native cublas handler with CUBLAS_POINTER_MODE_DEVICE. The two scalars, alpha and/or beta, and input/output matrices must be accessible in GPU memory space.

Know More About cublasFlow Capturer

We summarize below resources for you to know more about tf::cublasFlowCapturer: