This is boilerplate code I tend to use in every CUDA kernel to calculate thread information:
void fooKernel()
__global__
{// Thread info
const int blocksPerGrid = gridDim.x;
const int threadsPerBlock = blockDim.x;
const int totalThreadNum = blocksPerGrid * threadsPerBlock;
const int curThreadIdx = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;
// Rest of kernel
}
This is written for a kernel launched with a 1-dimensional grid and a 1-dimensional block. Adapting it for grid-blocks of different dimensions should be easy.