Threads in CUDA are the workers who work on data. In the CUDA architecture, threads are grouped into blocks and blocks are grouped into a grid. Given such an architecture, there are 2 common techniques to allocate the data among the threads. Or put another way, there are 2 ways for each thread to pick the data it should work on.
Technique 1: Chunks of Data Per Thread
In this technique, data is broken into many contiguous chunks and each thread works on one (or none) such chunk. Here is sample code for illustration:
void fooKernel( const int* dataArray, int dataNum )
__global__
{// Thread info
const int blocksPerGrid = gridDim.x;
const int threadsPerBlock = blockDim.x;
const int totalThreadNum = blocksPerGrid * threadsPerBlock;
const int curThreadIdx = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;
// Work allocation
const int dataPerThread = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;
const int curThreadDataBegin = dataPerThread * curThreadIdx;
const int curThreadDataEnd = curThreadDataBegin + dataPerThread;
// Iterate data chunk of this thread
for ( int idx = curThreadDataBegin; idx < curThreadDataEnd; ++idx )
{// Check if data out of bounds
if ( idx >= dataNum )
continue;
// Do something with data
int val = dataArray[ idx ];
}
return;
}
Note that the work allocation calculation is:
const int dataPerThread = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;
It is not a mere division of available work (data) by available labour (threads). Due to integer division, such a simple calculation would lead to trouble if dataNum
is not a multiple of totalThreadNum
.
So, if dataNum
is not a multiple of totalThreadNum
there will always be a few threads with no work. This is an unavoidable fact of life for this technique! 😊
This is also why we need to ensure we are always accessing something inside of the input data:
if ( idx >= dataNum )
continue;
This work allocation technique is advantageous if the work done by the thread benefits by having access to elements lying in the same chunk. If this is not the case, then Technique 2 is far easier to write and understand.
Technique 2: Iterate With a Large Increment
Simply put, in each iteration each thread accesses the data that is a long distance away from its current data. How far away? A distance equal to the total number of threads. This iteration is very simple to write:
void fooKernel( const int* dataArray, int dataNum )
__global__
{// Thread info
const int blocksPerGrid = gridDim.x;
const int threadsPerBlock = blockDim.x;
const int totalThreadNum = blocksPerGrid * threadsPerBlock;
const int curThreadIdx = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;
// Iterate over data
for ( int idx = curThreadIdx; idx < dataNum; idx += totalThreadNum )
{// Do something with data
int val = dataArray[ idx ];
}
return;
}
The work allocation calculation and the bounds check of Technique 1 are both not needed. All the magic is in the details of the loop:
for (
int idx = curThreadIdx; // Use thread index as beginning location
// Thread never goes outside the data
idx < dataNum; // Jump a long way
idx += totalThreadNum )
You can easily see that these techniques lie at the ends of a spectrum of possible work allocation techniques. For example, a hybrid technique would handle chunks of size n
and then increment by n * totalThreadNum
. Look at your application closely and use what works best for you! 😊