There is a good talk by David Kirk up on Youtube about Fermi. I found the talk quite useful, with Kirk explaining various concepts in CUDA and Fermi with better abstractions than what I had in my mind.
A few notes from the talk:
The CPU die area is mostly cache, with a bit of execution-control logic. The GPU die is the other way around. This is because the CPU is built for a cache hit, while the GPU is built for a cache miss.
L1 or shared memory latency: ~10 cycles Global memory latency: ~200-400 cycles
Even though a Fermi SM has 32K registers in its register file, the maximum number of registers allowed per thread is a mere 63. This constraint is due to the number of bits NVIDIA used for the addressing from a thread to its registers in the register file.
Core clock speeds remain around 1GHz, which is significantly lesser than CPU speeds. This is intentional, because power consumption increases super-linearly with clock speed. GPUs are already consuming a lot of power, it would not be good to see this increase further.
The minimum size of a global memory read is 128 bytes. This minimum is dictated by DRAM technology or manufacturers and will only get larger every year due to the way the DRAM technology is progressing.
The reason the CUDA profiler needs to run the application multiple times is that there are a limited number of counters per SM. (Kirk says 4, but CUDA programming guide says 8 counters.)