Code Yarns ‍👨‍💻
Tech BlogPersonal Blog

CUDA: Common Function for both Host and Device Code

📅 2011-Mar-14 ⬩ ✍️ Ashwin Nanjappa ⬩ 📚 Archive

If a function needs to be used both by host and by device code, it is convenient to define it in a single location. Such a function can be defined by applying both of the qualifiers __host__ __device__.

For example, a function that computes maximum of 2 numbers to be used both on host and device code:

__host__ __device__ int fooMax( int a, int b )
{
    return ( a > b ) ? a : b;
}

The __CUDA_ARCH__ macro can be used to if a part of the code of the function needs to be compiled selectively for either host or device:

__host__ __device__ int fooMax( int a, int b )
{
    // Compiled only on host
#ifndef __CUDA__ARCH__
    assert( ( a < 100 ) && ( b < 100 ) && "Numbers beyond the application bound!" );
#endif

    return ( a > b ) ? a : b;
}

Sometimes, the host and device code that use the common function can be spread across different compilation units. Here is such a scenario:

/////////////////////////////////////////////////
// fooDevice.cu
__global__ void fooKernel( const int* inArray, int num, int* outArray )
{
    // Do something
}
/////////////////////////////////////////////////

/////////////////////////////////////////////////
// fooDevice.h
__global__ void fooKernel( const int*, int, int* );
/////////////////////////////////////////////////

/////////////////////////////////////////////////
// fooHost.cu
#include "fooDevice.h"
void foo()
{
    fooKernel<<< x, y >>>( devInArray, num, devOutArray );
}
/////////////////////////////////////////////////

In the above example, the host code is in fooHost.cu, the device code is in fooDevice.cu and there is a fooDevice.h that is used to share kernel and function declarations between the two compilation units. In such a setup, how do we share the __host__ __device__ function between the 2 compilation units?

Placing the __host__ __device__ function in a header file is not a suitable solution since it leads to multiple definitions of the function when the header is included in multiple compilation units.

The only solution in such a setup is to place the __host__ __device__ function definition in the compilation unit that has device kernels that use this function. That is, place the function in fooDevice.cu and share its host function declaration in the fooDevice.h as follows:

/////////////////////////////////////////////////
// fooDevice.cu

__host__ __device__ int fooMax( int a, int b )
{
    return ( a > b ) ? a : b;
}

__global__ void fooKernel( const int* inArray, int num, int* outArray )
{
    // Do something using fooMax
    outArray[ idx ] = fooMax( inArray[ idx0 ], inArray[ idx1 ];
}
/////////////////////////////////////////////////

/////////////////////////////////////////////////
// fooDevice.h
int fooMax( int, int );
__global__ void fooKernel( const int*, int, int* );
/////////////////////////////////////////////////

/////////////////////////////////////////////////
// fooHost.cu
#include "fooDevice.h"
void foo()
{
    int m = fooMax( x, y );
    fooKernel<<< x, y >>>( devInArray, num, devOutArray );
}
/////////////////////////////////////////////////

Doing it the other way around will not work! That is, placing the __host__ __device__ function in fooHost.cu and sharing its function declaration to fooDevice.cu does not work since the CUDA compiler needs the ability to inline a __device__ function. And inlining a function is only possible if the entire function definition is available.

Tried with: CUDA 3.2