David Kirk on CUDA and Fermi

There is a good talk by David Kirk up on Youtube about Fermi. I found the talk quite useful, with Kirk explaining various concepts in CUDA and Fermi with better abstractions than what I had in my mind.

A few notes from the talk:

1. The CPU die area is mostly cache, with a bit of execution-control logic. The GPU die is the other way around. This is because the CPU is built for a cache hit, while the GPU is built for a cache miss.

2. L1 or shared memory latency: ~10 cycles
Global memory latency: ~200-400 cycles

3. Even though a Fermi SM has 32K registers in its register file, the maximum number of registers allowed per thread is a mere 63. This constraint is due to the number of bits NVIDIA used for the addressing from a thread to its registers in the register file.

4. Core clock speeds remain around 1GHz, which is significantly lesser than CPU speeds. This is intentional, because power consumption increases super-linearly with clock speed. GPUs are already consuming a lot of power, it would not be good to see this increase further.

5. The minimum size of a global memory read is 128 bytes. This minimum is dictated by DRAM technology or manufacturers and will only get larger every year due to the way the DRAM technology is progressing.

6. The reason the CUDA profiler needs to run the application multiple times is that there are a limited number of counters per SM. (Kirk says 4, but CUDA programming guide says 8 counters.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.