Code Yarns ‍👨‍💻
Tech BlogPersonal Blog

CUDA: Launch Bounds

📅 2011-Jun-27 ⬩ ✍️ Ashwin Nanjappa ⬩ 📚 Archive

The CUDA compiler decides on the number of registers to use for a kernel based on its complexity. Such a compiled kernel is flexible enough to be launched with any number of threads or blocks. However, if an approximate idea of the number of threads and blocks is known at compile-time, then this can be used to optimize the kernel for such launches. This is done by informing the compiler of this launch configuration, so that it has a better chance of tweaking the number of registers it will use for the kernel during compilation.

Such a launch bound is set for a kernel as follows:

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__( MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP )
fooKernel( int* inArr, int* outArr )
{
    // ... Computation of kernel
}

The MAX_THREADS_PER_BLOCK parameter is mandatory and the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with the thread number more than MAX_THREADS_PER_BLOCK, that launch will fail for obvious reasons. Tried with: CUDA 4.0