Code Yarns ‍👨‍💻
Tech BlogPersonal Blog

CUDA: Work Allocation Techniques

📅 2011-Feb-16 ⬩ ✍️ Ashwin Nanjappa ⬩ 🏷️ cuda ⬩ 📚 Archive

 

Threads in CUDA are the workers who work on data. In the CUDA architecture, threads are grouped into blocks and blocks are grouped into a grid. Given such an architecture, there are 2 common techniques to allocate the data among the threads. Or put another way, there are 2 ways for each thread to pick the data it should work on.

Technique 1: Chunks of Data Per Thread

In this technique, data is broken into many contiguous chunks and each thread works on one (or none) such chunk. Here is sample code for illustration:

__global__ void fooKernel( const int* dataArray, int dataNum )
{
    // Thread info
    const int blocksPerGrid   = gridDim.x;
    const int threadsPerBlock = blockDim.x;
    const int totalThreadNum  = blocksPerGrid * threadsPerBlock;
    const int curThreadIdx    = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;

    // Work allocation
    const int dataPerThread  = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;
    const int curThreadDataBegin = dataPerThread * curThreadIdx;
    const int curThreadDataEnd   = curThreadDataBegin + dataPerThread;

    // Iterate data chunk of this thread
    for ( int idx = curThreadDataBegin; idx < curThreadDataEnd; ++idx )
    {
        // Check if data out of bounds
        if ( idx >= dataNum )
            continue;

        // Do something with data
        int val = dataArray[ idx ];
    }

    return;
}

Note that the work allocation calculation is:

const int dataPerThread  = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;

It is not a mere division of available work (data) by available labour (threads). Due to integer division, such a simple calculation would lead to trouble if dataNum is not a multiple of totalThreadNum.

So, if dataNum is not a multiple of totalThreadNum there will always be a few threads with no work. This is an unavoidable fact of life for this technique! 😊

This is also why we need to ensure we are always accessing something inside of the input data:

if ( idx >= dataNum )
    continue;

This work allocation technique is advantageous if the work done by the thread benefits by having access to elements lying in the same chunk. If this is not the case, then Technique 2 is far easier to write and understand.

Technique 2: Iterate With a Large Increment

Simply put, in each iteration each thread accesses the data that is a long distance away from its current data. How far away? A distance equal to the total number of threads. This iteration is very simple to write:

__global__ void fooKernel( const int* dataArray, int dataNum )
{
    // Thread info
    const int blocksPerGrid   = gridDim.x;
    const int threadsPerBlock = blockDim.x;
    const int totalThreadNum  = blocksPerGrid * threadsPerBlock;
    const int curThreadIdx    = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;

    // Iterate over data
    for ( int idx = curThreadIdx; idx < dataNum; idx += totalThreadNum )
    {
        // Do something with data
        int val = dataArray[ idx ];
    }

    return;
}

The work allocation calculation and the bounds check of Technique 1 are both not needed. All the magic is in the details of the loop:

for (
    int idx = curThreadIdx; // Use thread index as beginning location
    idx < dataNum;          // Thread never goes outside the data
    idx += totalThreadNum   // Jump a long way
    )

You can easily see that these techniques lie at the ends of a spectrum of possible work allocation techniques. For example, a hybrid technique would handle chunks of size n and then increment by n * totalThreadNum. Look at your application closely and use what works best for you! 😊


© 2023 Ashwin Nanjappa • All writing under CC BY-SA license • 🐘 Mastodon📧 Email