Code Yarns ‍👨‍💻
Tech BlogPersonal Blog

How to set cache configuration in CUDA

📅 2011-Jun-27 ⬩ ✍️ Ashwin Nanjappa ⬩ 📚 Archive

On CUDA devices of compute capability 1.x, the amount of shared memory and L1 cache for each multiprocessor was fixed.

In devices of compute capability 2.0 and later, there is 64 KB of memory for each multiprocessor. This per-multiprocessor on-chip memory is split and used for both shared memory and L1 cache. By default, 48 KB is used as shared memory and 16 KB as L1 cache.

As CUDA kernels get more complex, they start to behave like CPU programs. There is lesser need to share data between kernels and more pressure for L1 caching. So, depending on the kind of kernels you are writing or using, you may want to change how the on-chip memory is allocated for them.

Set globally

The cudaDeviceSetCacheConfig function can be used to set preference for shared memory or L1 cache globally for all CUDA kernels in your code and even those used by Thrust. The option cudaFuncCachePreferShared prefers shared memory, that is, it sets 48 KB for shared memory and 16 KB for L1 cache. cudaFuncCachePreferL1 prefers L1, that is, it sets 16 KB for shared memory and 48 KB for L1 cache. cudaFuncCachePreferNone uses the preference set for the device or thread.

For example, to prefer L1 cache for all your kernels:

cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

Set per kernel

The cudaFuncSetCacheConfig runtime function can be used to set a specific kernel to prefer the usage of the per-multiprocessor memory for either shared memory or L1 cache.

For example, to prefer L1 cache for a kernel:

// Kernel
__global__ void fooKernel(int* inArr, int* outArr)
{
    // ... Computation of kernel
    return;
}

int main()
{
    cudaFuncSetCacheConfig(fooKernel, cudaFuncCachePreferL1);

    // Call kernel any time after cache preference is set
    fooKernel<<<T, B>>>(inArr, outArr);

    return 0;
}