Code Yarns ‍👨‍💻
Tech BlogPersonal Blog

CUDA: Memory Checker

📅 2011-Mar-07 ⬩ ✍️ Ashwin Nanjappa ⬩ 📚 Archive

 

CUDA essentially takes the programmer back to the old world of C where cowboys managed memory all by themselves. This means that memory bugs are sure to creep into even the most carefully written CUDA code!

The CUDA memory checker is one of the tools that can be used to detect and fix memory bugs. On Windows, the CUDA memory checker ships as a standalone program named cuda-memcheck.exe and can be found in the %CUDA_BIN_PATH% directory.

CUDA memory checker checks for 2 kinds of memory bugs: out-of-bounds and misaligned accesses in global memory. Here is a simple CUDA program with one such memory bug:

__global__ void fooKernel(const int* inArr, int num, int* outArr)
{
    const int threadNum    = gridDim.x * blockDim.x;
    const int curThreadIdx = (blockIdx.x * blockDim.x) + threadIdx.x;

    for (int i = curThreadIdx; i < num; i += threadNum)
        outArr[i] = inArr[i] + inArr[i + 1];

    return;
}

int main()
{
    // Allocate memory

    const int num   = 100;
    int* dInArr     = NULL;
    int* dOutArr    = NULL;
    const int size  = num * sizeof(*dInArr);

    cudaMalloc(&dInArr, size);
    cudaMalloc(&dOutArr, size);

    // Compute

    const int blocksPerGrid     = 128;
    const int threadsPerBlock   = 128;

    fooKernel<<<blocksPerGrid, threadsPerBlock>>>(dInArr, num, dOutArr);

    // Free memory

    cudaFree(dInArr);
    cudaFree(dOutArr);

    return 0;
}

fooKernel is doing one iteration of the prefix scan algorithm, adding each element of the input array with its next element and storing that result. The thread accessing the last element [99] will also access [100], which is out-of-bounds of the input array.

Such bugs execute silently producing puzzling results and can escape all kinds of CUDA error checking done by the astute programmer.

When this program is executed with the CUDA memory checker, the memory bug is detected:

$> cuda-memcheck Foo.exe
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000070 in fooKernel
=========     by thread (99,0,0) in block (0,0)
=========     Address 0x05100190 is out of bounds
=========
========= ERROR SUMMARY: 1 error

Not only is the programmer now aware of a bug in his program, he knows much more. The output of CUDA memory checker points out that there was an out-of-bounds read of a datatype of size 4 in fooKernel by thread number 99. Armed with this information, the memory bug can hopefully be found and squashed! 😊

Tried with: CUDA 3.2