Notes of talk: C++ in the 21st century

I recently came across a 2014 talk by Arvid Norberg about the new features in C++11. The video is here and slides are here.

C++ is huge and getting bigger every day. So, I keep discovering interesting new features that I like to note down for use in my own code. Below are my notes from this talk. I do not note aspects that I already know well. This talk has examples that are small but illustrative, so if you hit any of these features, you should see the video to look at the examples.

For loops

  • std::begin and std::end work on C arrays too. Note that this is only when the array size is known. So, the array must have been created in the same local scope.


  • decltype deduces the type of an expression. So its use is in type expressions. For example, as template arguments.
// vector of the return type of function f
std::vector<decltype(f())> vals;
  • Internally, it is used by auto to deduce type of expression

lambda functions

  • Lambda expression yields an unnamed function object. The tiny examples in the talk are good.


  • This is to help programmers find errors. For example, when virtual method in base class is not const and in derived it is. Programmer might miss this error. If virtual method in derived class is declared override and it is actually not, compiler will complain.


  • This smart pointer is not copyable, but movable. It is deleted when pointer goes out of scope.
  • Many functions create a heap-allocated object and return it. Traditionally, programmers had to worry about the ownership and lifetime of such a returned object. Return it as unique_ptr and forget about these worries.
  • Also great for storing such heap-allocated objects in containers.


This C++11 feature was something new to me! I did not understand how to apply it either. I might need to study this in future.

  • error_code represents an error. It has error_value integral value indicating what is the error. It has category indicating domain of error value.
  • category is an abstract base class implementing conversion of error_value to human readable message.


  • There are a whole bunch of old C, C++, Unix and POSIX time functions. They are not platform agnostic, have low time resolutions, have no type safety (milliseconds value can be passed to a function that takes in microseconds and so on) and are not monotonic. Monotonic in this context means that if you measure a time before DST is turned on and after it, the latter value should always be larger, though the wall clock may have been turned back by DST.
  • Chrono introduces a clock with its own epoch (start of life) and its own resolution.
  • time_point: A point in time relative to epoch. It has its resolution encoded inside it.
  • time_duration: Difference of two time points. It has its resolution encoded inside it.
  • Because these types have their resolutions embedded inside, two durations of different resolutions can be added together to produce a duration that has resolution that is highest or higher than both. They can be passed to function that accepts in a different resolution. The template machinery ensures that it all converts correctly.

David Kirk on CUDA and Fermi

There is a good talk by David Kirk up on Youtube about Fermi. I found the talk quite useful, with Kirk explaining various concepts in CUDA and Fermi with better abstractions than what I had in my mind.

A few notes from the talk:

1. The CPU die area is mostly cache, with a bit of execution-control logic. The GPU die is the other way around. This is because the CPU is built for a cache hit, while the GPU is built for a cache miss.

2. L1 or shared memory latency: ~10 cycles
Global memory latency: ~200-400 cycles

3. Even though a Fermi SM has 32K registers in its register file, the maximum number of registers allowed per thread is a mere 63. This constraint is due to the number of bits NVIDIA used for the addressing from a thread to its registers in the register file.

4. Core clock speeds remain around 1GHz, which is significantly lesser than CPU speeds. This is intentional, because power consumption increases super-linearly with clock speed. GPUs are already consuming a lot of power, it would not be good to see this increase further.

5. The minimum size of a global memory read is 128 bytes. This minimum is dictated by DRAM technology or manufacturers and will only get larger every year due to the way the DRAM technology is progressing.

6. The reason the CUDA profiler needs to run the application multiple times is that there are a limited number of counters per SM. (Kirk says 4, but CUDA programming guide says 8 counters.)