The CUDA Occupancy Calculator is an Excel spreadsheet that ships with the CUDA Toolkit. It can be used to determine if the number of threads per block being used to launch a kernel is optimal. This spreadsheet can be found at
%ProgramData%\NVIDIA Corporation\GPU SDK\C\tools
The spreadsheet requires 4 inputs from you specific to the kernel you are analyzing:
The compute capability of the CUDA device
Threads per block you are using for the kernel
Registers per thread for the kernel
Shared memory per block
You already know (1) and (2) since you are the author of the kernel. (3) and (4) can be found by compiling the code with the option
--ptxas-options=-v. This information can be found in the Output window of Visual Studio during compilation. Another alternative is to run the CUDA program with the Compute Visual Profiler and this information can be found in the Profiler Output sheet.
Once the above 4 numbers are entered, the 3 charts on the spreadsheet update to show the position of your kernel on them. The 3 charts deal with the parameters threads per block, registers per block and shared memory respectively. Look for the red triangle on the chart whose parameter you have the flexibility to change.
For example, say for a given kernel I have no say in the number of registers and shared memory it uses. However, I have the ability to change the number of threads per block it launches with. Assume that I am currently using 200 threads per block for this kernel. For this case, I look at Chart 1 (Varying Block Size), and check if the red triangle is on any of the global maxima of the curve. If it is not, I look at the threads per block that will put this kernel at the global maxima and try my kernel with that (say 256). In most cases, my CUDA program should execute a bit faster due to this change since the occupancy of the GPU by the threads of this kernel has been improved.
Tried with: CUDA 4.0