If a function needs to be used both by host and by device code, it is convenient to define it in a single location. Such a function can be defined by applying both of the qualifiers __host__ __device__
.
For example, a function that computes maximum of 2 numbers to be used both on host and device code:
int fooMax( int a, int b )
__host__ __device__
{return ( a > b ) ? a : b;
}
The __CUDA_ARCH__
macro can be used to if a part of the code of the function needs to be compiled selectively for either host or device:
int fooMax( int a, int b )
__host__ __device__
{// Compiled only on host
#ifndef __CUDA__ARCH__
assert( ( a < 100 ) && ( b < 100 ) && "Numbers beyond the application bound!" );
#endif
return ( a > b ) ? a : b;
}
Sometimes, the host and device code that use the common function can be spread across different compilation units. Here is such a scenario:
/////////////////////////////////////////////////
// fooDevice.cu
void fooKernel( const int* inArray, int num, int* outArray )
__global__
{// Do something
}/////////////////////////////////////////////////
/////////////////////////////////////////////////
// fooDevice.h
void fooKernel( const int*, int, int* );
__global__ /////////////////////////////////////////////////
/////////////////////////////////////////////////
// fooHost.cu
#include "fooDevice.h"
void foo()
{
fooKernel<<< x, y >>>( devInArray, num, devOutArray );
}/////////////////////////////////////////////////
In the above example, the host code is in fooHost.cu
, the device code is in fooDevice.cu
and there is a fooDevice.h
that is used to share kernel and function declarations between the two compilation units. In such a setup, how do we share the __host__ __device__
function between the 2 compilation units?
Placing the __host__ __device__
function in a header file is not a suitable solution since it leads to multiple definitions of the function when the header is included in multiple compilation units.
The only solution in such a setup is to place the __host__ __device__
function definition in the compilation unit that has device kernels that use this function. That is, place the function in fooDevice.cu
and share its host function declaration in the fooDevice.h
as follows:
/////////////////////////////////////////////////
// fooDevice.cu
int fooMax( int a, int b )
__host__ __device__
{return ( a > b ) ? a : b;
}
void fooKernel( const int* inArray, int num, int* outArray )
__global__
{// Do something using fooMax
outArray[ idx ] = fooMax( inArray[ idx0 ], inArray[ idx1 ];
}/////////////////////////////////////////////////
/////////////////////////////////////////////////
// fooDevice.h
int fooMax( int, int );
void fooKernel( const int*, int, int* );
__global__ /////////////////////////////////////////////////
/////////////////////////////////////////////////
// fooHost.cu
#include "fooDevice.h"
void foo()
{int m = fooMax( x, y );
fooKernel<<< x, y >>>( devInArray, num, devOutArray );
}/////////////////////////////////////////////////
Doing it the other way around will not work! That is, placing the __host__ __device__
function in fooHost.cu
and sharing its function declaration to fooDevice.cu
does not work since the CUDA compiler needs the ability to inline a __device__
function. And inlining a function is only possible if the entire function definition is available.
Tried with: CUDA 3.2