Code Yarns ‍👨‍💻
Tech BlogPersonal Blog

CUDA: Device Function in Header File

📅 2011-Mar-15 ⬩ ✍️ Ashwin Nanjappa ⬩ 🏷️ cuda, ptx, templates ⬩ 📚 Archive

Template kernels are very useful to write generic kernels that can handle multiple data types. (For more on template kernels go here.) However, they come with one drawback: any device function that is called by a template kernel needs to be defined in a header file. It cannot be hidden away in a different CUDA compilation unit. This is because the CUDA compiler needs to have the flexibility to inline the device function (if necessary) and to do that the device function definition needs to be accessible in the same compilation unit as the template kernel that calls it.

Here is an example. The template kernel below calls a device function and the definition of the device function needs to be in the header file for successful compilation:

__device__ int getCurThreadIdx()
    return ( ( blockIdx.x * blockDim.x ) + threadIdx.x );

template< typename T >
__global__ void fooKernel( const T* inArr, int num, T* outArr )
    const int threadNum = ( gridDim.x * blockDim.x );

    for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )
        outArr[ idx ] = inArr[ idx ];


One wishes that this were enough, but there seems to be an additional problem with CUDA. If the header file containing these functions is merely included into multiple CUDA compilation units, one faces a multiple definition error on the device function: : error LNK2005: "int __cdecl getCurThreadIdx(void)" (?getCurThreadIdx@@YAHXZ) already defined in

The only solution for this seems to be that the device function being called by a template kernel also needs to be explicitly inlined! The explicit inlining can be specified using the __forceinline__ qualifier on the device function:

__forceinline__ __device__ int getCurThreadIdx()
    return ( ( blockIdx.x * blockDim.x ) + threadIdx.x );

I further investigated by checking the .ptx files when the device function is forceinline and when it is not. As I guessed, the template kernel is inlining the device function anyway in both these cases. This is what I expected since the device function is very simple and should be inlined for optimization.


However, when the device function is not forceinline, the CUDA compiler seems to be generating the device function definition anyway, even though it is not being called by the template function! 😊

The problem with this behaviour by CUDA is that it means that only explicitly inlined device functions can be called from template kernels. This places severe restrictions on the functionality that can be achieved in a template kernel. Device functions that are long or computationally intensive cannot work with the template kernel due to function length restrictions. This beats the whole point of the Fermi architecture, which was to allow true function calls, not mere inlining! 😐

Note: More discussion on this topic is at StackOverflow here and NVIDIA Forums here.

Tried with: CUDA 3.2