CUDA: Memory Checker

CUDA essentially takes the programmer back to the old world of C where cowboys managed memory all by themselves. This means that memory bugs are sure to creep into even the most carefully written CUDA code!

The CUDA memory checker is one of the tools that can be used to detect and fix memory bugs. On Windows, the CUDA memory checker ships as a standalone program named cuda-memcheck.exe and can be found in the %CUDA_BIN_PATH% directory.

CUDA memory checker checks for 2 kinds of memory bugs: out-of-bounds and misaligned accesses in global memory. Here is a simple CUDA program with one such memory bug:

__global__ void fooKernel(const int* inArr, int num, int* outArr)
    const int threadNum    = gridDim.x * blockDim.x;
    const int curThreadIdx = (blockIdx.x * blockDim.x) + threadIdx.x;

    for (int i = curThreadIdx; i < num; i += threadNum)
        outArr[i] = inArr[i] + inArr[i + 1];


int main()
    // Allocate memory

    const int num   = 100;
    int* dInArr     = NULL;
    int* dOutArr    = NULL;
    const int size  = num * sizeof(*dInArr);

    cudaMalloc(&dInArr, size);
    cudaMalloc(&dOutArr, size);

    // Compute

    const int blocksPerGrid     = 128;
    const int threadsPerBlock   = 128;

    fooKernel<<<blocksPerGrid, threadsPerBlock>>>(dInArr, num, dOutArr);

    // Free memory


    return 0;

fooKernel is doing one iteration of the prefix scan algorithm, adding each element of the input array with its next element and storing that result. The thread accessing the last element [99] will also access [100], which is out-of-bounds of the input array.

Such bugs execute silently producing puzzling results and can escape all kinds of CUDA error checking done by the astute programmer.

When this program is executed with the CUDA memory checker, the memory bug is detected:

$> cuda-memcheck Foo.exe
========= Invalid __global__ read of size 4
=========     at 0x00000070 in fooKernel
=========     by thread (99,0,0) in block (0,0)
=========     Address 0x05100190 is out of bounds
========= ERROR SUMMARY: 1 error

Not only is the programmer now aware of a bug in his program, he knows much more. The output of CUDA memory checker points out that there was an out-of-bounds read of a datatype of size 4 in fooKernel by thread number 99. Armed with this information, the memory bug can hopefully be found and squashed! šŸ™‚

Tried with: CUDA 3.2

10 thoughts on “CUDA: Memory Checker

  1. Hi,
    Nice post. I have a query. When you call a CUDA kernel with a specified number of threads and blocks, how many concurrent operations are done? In your kernel, you are incrementing your variable ‘i’ by ‘threadNum’ for each iteration, does this mean that each block is executed at a time?

    Thanks in advance.


    1. Bharath: The actual number depends on the launch parameters and the graphics card. But it is true that a maximum of threadNum kernels can execute concurrently.


    2. Hi,
      If I have a code something like this,

      for ( int ; šŸ˜‰
      for (int ; ; )

      If the data from a loop depends on the data from the previous loop, how to implement such a situation in cuda?

      How actually does the loops work inside a cuda kernel? And can a kernel with a different number of threads can be called from inside another kernel?




    3. Bharath: For loops work inside a CUDA kernel just like in C or C++. Kernels cannot be called from each other. You can however write a device function that can be called from inside a kernel.


  2. Hi,
    I have a query regarding the usage of shared memory on GPU.
    I am working with a block size of 1024, assuming 256 threads per block I would be using 8 blocks.
    Now, there is 16KB of shared memory associated with each of these 8 blocks. How do I access each of these shared memories? On the whole I will be getting 256KB of shared memory with no synchronization among them..right?

    I call a kernel say, bitreverse <<>> ( );
    So, it assigns 8 blocks with 256 threads each. Now how can I access each of 16KB shared memory associated with each of these 8 individual blocks in the kernel function?




    1. Bharath: No idea, since I tend to use the shared memory as a L1 cache rather than explicitly as shared memory. Try asking your query on StackOverflow, I am pretty sure the clever folks there have the answer šŸ™‚


  3. Hi,
    I have a kernel launch like this… assign <<>> ( ) ;

    I have to maximize the usage of shared memory, so I am using only 1 block so that entire 16KB of shared memory is available for that block.

    And now, I have to run multiple instances of the above kernel simultaneously. I can’t have multiple blocks because in that case the shared memory per block would be lesser.
    So, can I run the above kernel for multiple grids, each grid containing one block of 512 threads? In that case, are all the grids executed simultaneously?




  4. Hi,
    When I launch a kernel with n blocks, (assuming there is no dependence among the blocks), are all these blocks executed in parallel?

    Say, a particular part of my code uses 12Kb of shared memory. For GPUs of compute capability 2.0 and above, there is 48Kb of shared memory available. So, can I launch 4 simultaneous such instances to utilize the shared memory completely. It’s like launching a kernel with 4 blocks. Are these blocks executed in parallel?




    1. Bharath: Whether all your blocks are executed in parallel depends on various factors including the number of SMs in your GPU. Since only one block is actually executing at a time on a SM your other doubts are not relevant.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.