📅 2011-Mar-14 ⬩ ✍️ Ashwin Nanjappa ⬩ 🏷️ cuda, template ⬩ 📚 Archive
CUDA kernel functions can be made generic by writing them as template kernel functions. Doing this is almost similar to writing template functions in C++.
Consider this kernel that merely writes the input integer array to the output integer array:
//////////////////////////////////////////
// FooDevice.cu
void fooKernel( const int* inArr, int* outArr, int num )
__global__
{const int curThreadIndex = ( blockIdx.x * blockDim.x ) + threadIdx.x;
const int threadNum = gridDim.x * blockDim.x;
for ( int index = curThreadIndex; index < num; index += threadNum )
outArr[ index ] = inArr[ index ];return;
}//////////////////////////////////////////
//////////////////////////////////////////
// FooDevice.h
void fooKernel( const int*, int*, int );
__global__ //////////////////////////////////////////
Template kernels are a good solution to enable this kernel to deal with arrays of any type. Just like in C++, CUDA template kernels need to be defined in a header file. This is because the compiler generates the code for a function that is specialized for a given type. To be able to do this at compile time inside a compilation unit, the function definition needs to be completely visible.
Rewriting the above kernel as a template kernel in the header file:
//////////////////////////////////////////
// FooDevice.cu
// Nothing here
//////////////////////////////////////////
//////////////////////////////////////////
// FooDevice.h
template< typename T >
void fooKernel( const T* inArr, T* outArr, int num )
__global__
{const int curThreadIndex = ( blockIdx.x * blockDim.x ) + threadIdx.x;
const int threadNum = gridDim.x * blockDim.x;
for ( int index = curThreadIndex; index < num; index += threadNum )
outArr[ index ] = inArr[ index ];return;
}//////////////////////////////////////////
Tried with: CUDA 3.2