The next step for a CUDA newbie after CUDA by Example would naturally be the book Programming Massively Parallel Processors by David Kirk and Wen-mei Hwu. The authors taught the first course on CUDA at UIUC for a few semesters and this book is based on its lecture notes. I had used their lecture notes and videos when I first learnt CUDA.
To write any CUDA application that moulds the problem optimally to the CUDA architecture requires the programmer to think very differently from programming on a CPU. Using a matrix multiplication example, the authors walk the student through many levels of improvement. The authors introduce the different facets of the architecture and end up improving the performance of the solution by as much as two orders of magnitude in the end.
All the concepts of the CUDA architecture are covered: the thread-block-grid hierarchy, the global-shared-local memories and barrier synchronization. Details of the warps and the warp scheduler are explained. Since most CUDA applications are scientific, there is an entire chapter on the floating point format. This chapter gives a practitioner's perspective that I found to be more useful than the popular but obscure What every computer scientist should know about floating-point arithmetic. There are two chapters on application case studies, which are mostly useless since one cannot understand the application intimately enough to draw any lessons from it.
CUDA runs only on NVIDIA devices. OpenCL is its twin that is designed to be used on all kinds of CPU and GPU processors. The authors have thrown in a chapter on OpenCL for folks who need to transition to it. OpenCL is exactly like CUDA, except that it does not have an equivalent of the CUDA Runtime API. So, the programmer ends up spending some time building the scaffolding required to run her kernels.
Programming Massively Parallel Processors is a easy book to study from. It should be accessible to any intermediate-to-expert programmer. Newbies can check out CUDA by Example before studying this book. I do wish this book covered some information on cache configuration, launch bounds, profiling, compiler options and other intimate details which one ends up using to squeeze out the last bit of performance. Currently, I do need to fall back onto the CUDA Programming Guide for such information. The book is also a wee bit outdated since the Fermi architecture is not well covered and the new Kepler architecture has already been released.