Taskflow  2.7.0
tf::cudaFlow Class Reference

methods for building a CUDA task dependency graph. More...

#include <cuda_flow.hpp>

Public Member Functions

bool empty () const
 queries the emptiness of the graph
 
cudaTask noop ()
 creates a no-operation task More...
 
template<typename F , typename... ArgsT>
cudaTask kernel (dim3 g, dim3 b, size_t s, F &&f, ArgsT &&... args)
 creates a kernel task More...
 
template<typename F , typename... ArgsT>
cudaTask kernel_on (int d, dim3 g, dim3 b, size_t s, F &&f, ArgsT &&... args)
 creates a kernel task on a device More...
 
cudaTask memset (void *dst, int v, size_t count)
 creates a memset task More...
 
cudaTask memcpy (void *tgt, const void *src, size_t bytes)
 creates a memcpy task More...
 
template<typename T >
std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), cudaTaskzero (T *dst, size_t count)
 creates a zero task that zeroes a typed memory block More...
 
template<typename T >
std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), cudaTaskfill (T *dst, T value, size_t count)
 creates a fill task that fills a typed memory block with a value More...
 
template<typename T , std::enable_if_t<!std::is_same< T, void >::value, void > * = nullptr>
cudaTask copy (T *tgt, const T *src, size_t num)
 creates a copy task More...
 
void device (int device)
 assigns a device to launch the cudaFlow More...
 
int device () const
 queries the device associated with the cudaFlow
 
template<typename P >
void join_until (P &&predicate)
 offloads the cudaFlow with the given stop predicate and then joins the execution More...
 
void join_n (size_t N)
 offloads the cudaFlow by the given times and then joins the execution More...
 
void join ()
 offloads the cudaFlow once and then joins the execution
 
template<typename I , typename C >
cudaTask for_each (I first, I last, C &&callable)
 applies a callable to each dereferenced element of the data array More...
 
template<typename I , typename C >
cudaTask for_each_index (I first, I last, I step, C &&callable)
 applies a callable to each index in the range with the step size More...
 
template<typename T , typename C , typename... S>
cudaTask transform (T *tgt, size_t N, C &&callable, S *... srcs)
 applies a callable to a source range and stores the result in a target ange More...
 
template<typename T >
cudaTask transpose (const T *d_in, T *d_out, size_t rows, size_t cols)
 

Friends

class Executor
 

Detailed Description

methods for building a CUDA task dependency graph.

A cudaFlow is a high-level interface to manipulate GPU tasks using the task dependency graph model. The class provides a set of methods for creating and launch different tasks on one or multiple CUDA devices, for instance, kernel tasks, data transfer tasks, and memory operation tasks.

Member Function Documentation

◆ copy()

template<typename T , std::enable_if_t<!std::is_same< T, void >::value, void > * >
cudaTask tf::cudaFlow::copy ( T *  tgt,
const T *  src,
size_t  num 
)

creates a copy task

Template Parameters
Telement type (non-void)
Parameters
tgtpointer to the target memory block
srcpointer to the source memory block
numnumber of elements to copy
Returns
cudaTask handle

A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

◆ device()

void tf::cudaFlow::device ( int  device)
inline

assigns a device to launch the cudaFlow

A cudaFlow can only be assigned to a device once.

Parameters
devicetarget device identifier

◆ fill()

template<typename T >
std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), cudaTask > tf::cudaFlow::fill ( T *  dst,
value,
size_t  count 
)

creates a fill task that fills a typed memory block with a value

Template Parameters
Telement type (size of T must be either 1, 2, or 4)
Parameters
dstpointer to the destination device memory area
valuevalue to fill for each element of type T
countnumber of elements

A fill task fills the first count elements of type T with value in a device memory area pointed by dst. The value to fill is interpreted in type T rather than byte.

◆ for_each()

template<typename I , typename C >
cudaTask tf::cudaFlow::for_each ( first,
last,
C &&  callable 
)

applies a callable to each dereferenced element of the data array

Template Parameters
Iiterator type
Ccallable type
Parameters
datapointer to the beginning address of the data array
Nsize of the data array
callablethe callable to apply to each dereferenced element

This method is equivalent to the parallel execution of the following loop on a GPU:

for(auto itr = first; itr != last; i++) {
callable(*itr);
}

◆ for_each_index()

template<typename I , typename C >
cudaTask tf::cudaFlow::for_each_index ( first,
last,
step,
C &&  callable 
)

applies a callable to each index in the range with the step size

Template Parameters
Iindex type
Ccallable type
Parameters
firstbeginning index
lastlast index
stepstep size
callablethe callable to apply to each element in the data array

This method is equivalent to the parallel execution of the following loop on a GPU:

// step is positive <tt>[first, last)</tt>
for(auto i=first; i<last; i+=step) {
callable(i);
}
// step is negative <tt>[first, last)</tt>
for(auto i=first; i>last; i+=step) {
callable(i);
}

◆ join_n()

void tf::cudaFlow::join_n ( size_t  N)
inline

offloads the cudaFlow by the given times and then joins the execution

Template Parameters
Nnumber of executions

◆ join_until()

template<typename P >
void tf::cudaFlow::join_until ( P &&  predicate)

offloads the cudaFlow with the given stop predicate and then joins the execution

Template Parameters
Ppredicate type (a binary callable)
Parameters
predicatea binary predicate (returns true for stop)

Immediately offloads the present cudaFlow onto a GPU and repeatedly executes it until the predicate returns true. When execution finishes, the cudaFlow is joined. A joined cudaflow becomes invalid and cannot take other operations.

◆ kernel()

template<typename F , typename... ArgsT>
cudaTask tf::cudaFlow::kernel ( dim3  g,
dim3  b,
size_t  s,
F &&  f,
ArgsT &&...  args 
)

creates a kernel task

Template Parameters
Fkernel function type
ArgsTkernel function parameters type
Parameters
gconfigured grid
bconfigured block
sconfigured shared memory
fkernel function
argsarguments to forward to the kernel function by copy
Returns
cudaTask handle

◆ kernel_on()

template<typename F , typename... ArgsT>
cudaTask tf::cudaFlow::kernel_on ( int  d,
dim3  g,
dim3  b,
size_t  s,
F &&  f,
ArgsT &&...  args 
)

creates a kernel task on a device

Template Parameters
Fkernel function type
ArgsTkernel function parameters type
Parameters
ddevice identifier to launch the kernel
gconfigured grid
bconfigured block
sconfigured shared memory
fkernel function
argsarguments to forward to the kernel function by copy
Returns
cudaTask handle

◆ memcpy()

cudaTask tf::cudaFlow::memcpy ( void *  tgt,
const void *  src,
size_t  bytes 
)
inline

creates a memcpy task

Parameters
tgtpointer to the target memory block
srcpointer to the source memory block
bytesbytes to copy
Returns
cudaTask handle

A memcpy task transfers bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

◆ memset()

cudaTask tf::cudaFlow::memset ( void *  dst,
int  v,
size_t  count 
)
inline

creates a memset task

Parameters
dstpointer to the destination device memory area
vvalue to set for each byte of specified memory
countsize in bytes to set

A memset task fills the first count bytes of device memory area pointed by dst with the byte value v.

◆ noop()

cudaTask tf::cudaFlow::noop ( )
inline

creates a no-operation task

An empty node performs no operation during execution, but can be used for transitive ordering. For example, a phased execution graph with 2 groups of n nodes with a barrier between them can be represented using an empty node and 2*n dependency edges, rather than no empty node and n^2 dependency edges.

◆ transform()

template<typename T , typename C , typename... S>
cudaTask tf::cudaFlow::transform ( T *  tgt,
size_t  N,
C &&  callable,
S *...  srcs 
)

applies a callable to a source range and stores the result in a target ange

Template Parameters
Tresult type
Ccallable type
Ssource types
Parameters
tgtpointer to the starting address of the target range
Nnumber of elements in the range
callablethe callable to apply to each element in the range
srcspointers to the starting addresses of source ranges

This method is equivalent to the parallel execution of the following loop on a GPU:

for(size_t i=0; i<N; i++) {
tgt[i] = callable(src1[i], src2[i], src3[i], ...);
}

◆ zero()

template<typename T >
std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), cudaTask > tf::cudaFlow::zero ( T *  dst,
size_t  count 
)

creates a zero task that zeroes a typed memory block

Template Parameters
Telement type (size of T must be either 1, 2, or 4)
Parameters
dstpointer to the destination device memory area
countnumber of elements

A zero task zeroes the first count elements of type T in a device memory area pointed by dst.


The documentation for this class was generated from the following files: