How to print structure of C++ unordered container

The unordered containers in C++ are implemented as hash tables with separate chaining. The standard does not specify this implementation explicitly. But, by examining the requirements of the container, this is and will be the implementation in any C++ standard library.

Given this implementation, it would be an instructive exercise to be able to print out the actual structure of the hash table, with its buckets and linked list of keys. Thankfully, this is possible since the standard provides a bucket_count method and a begin method that takes a bucket index as input.

Here is simple code that prints buckets of unordered_set and its output on my computer:

Tried with: GCC 5.1 and Ubuntu 14.04

Prime number buckets in hash table implementations

Prime numbers are very important in the implementation of hash tables. They might be used in computing a hash for a given key using a hash function. Also, they seem to be commonly used as the size of the hash table, i.e., the number of buckets in the table.

Consider, a hash table with separate chaining. When there is a collision, i.e., multiple keys map to the same bucket, they are maintained in a list at that bucket. The load factor of the hash table is the average number of keys per bucket in the table. When the load factor increases beyond a certain threshold, a new hash table with larger number of buckets is chosen, the existing keys are rehashed and inserted anew into the new hash table.

A key part of a hash table implementation is: what number of buckets to choose when increasing or decreasing the size of a hash table? In containers like vectors, this is easy: in most implementations it grows by 2x the size. But, for hash tables, I found that implementations seem to prefer prime numbers. So, how do hash table implementations pick this prime number?

Hash table in C++ STL of GCC

The containers based on hash tables in C++ STL are unordered_set, unordered_multiset, unordered_map and unordered_multimap. All of them are based on the same underlying hash table implementation. Computing an optimal prime number for a given load factor and number of keys is not easy.

Not surprisingly, the GCC 5.1 implementation of STL has a pre-computed lookup table of prime numbers. I found it in /usr/include/c++/5/ext/pb_ds/detail/resize_policy/hash_prime_size_policy_imp.hpp. Here is the array of prime numbers it uses:

static const std::size_t g_a_sizes[num_distinct_sizes_64_bit] =
{
  /* 0     */              5ul,
  /* 1     */              11ul, 
  /* 2     */              23ul, 
  /* 3     */              47ul, 
  /* 4     */              97ul, 
  /* 5     */              199ul, 
  /* 6     */              409ul, 
  /* 7     */              823ul, 
  /* 8     */              1741ul, 
  /* 9     */              3469ul, 
  /* 10    */              6949ul, 
  /* 11    */              14033ul, 
  /* 12    */              28411ul, 
  /* 13    */              57557ul, 
  /* 14    */              116731ul, 
  /* 15    */              236897ul,
  /* 16    */              480881ul, 
  /* 17    */              976369ul,
  /* 18    */              1982627ul, 
  /* 19    */              4026031ul,
  /* 20    */              8175383ul, 
  /* 21    */              16601593ul, 
  /* 22    */              33712729ul,
  /* 23    */              68460391ul, 
  /* 24    */              139022417ul, 
  /* 25    */              282312799ul, 
  /* 26    */              573292817ul, 
  /* 27    */              1164186217ul,
  /* 28    */              2364114217ul, 
  /* 29    */              4294967291ul,
  /* 30    */ (std::size_t)8589934583ull,
  /* 31    */ (std::size_t)17179869143ull,
  /* 32    */ (std::size_t)34359738337ull,
  /* 33    */ (std::size_t)68719476731ull,
  /* 34    */ (std::size_t)137438953447ull,
  /* 35    */ (std::size_t)274877906899ull,
  /* 36    */ (std::size_t)549755813881ull,
  /* 37    */ (std::size_t)1099511627689ull,
  /* 38    */ (std::size_t)2199023255531ull,
  /* 39    */ (std::size_t)4398046511093ull,
  /* 40    */ (std::size_t)8796093022151ull,
  /* 41    */ (std::size_t)17592186044399ull,
  /* 42    */ (std::size_t)35184372088777ull,
  /* 43    */ (std::size_t)70368744177643ull,
  /* 44    */ (std::size_t)140737488355213ull,
  /* 45    */ (std::size_t)281474976710597ull,
  /* 46    */ (std::size_t)562949953421231ull, 
  /* 47    */ (std::size_t)1125899906842597ull,
  /* 48    */ (std::size_t)2251799813685119ull, 
  /* 49    */ (std::size_t)4503599627370449ull,
  /* 50    */ (std::size_t)9007199254740881ull, 
  /* 51    */ (std::size_t)18014398509481951ull,
  /* 52    */ (std::size_t)36028797018963913ull, 
  /* 53    */ (std::size_t)72057594037927931ull,
  /* 54    */ (std::size_t)144115188075855859ull,
  /* 55    */ (std::size_t)288230376151711717ull,
  /* 56    */ (std::size_t)576460752303423433ull,
  /* 57    */ (std::size_t)1152921504606846883ull,
  /* 58    */ (std::size_t)2305843009213693951ull,
  /* 59    */ (std::size_t)4611686018427387847ull,
  /* 60    */ (std::size_t)9223372036854775783ull,
  /* 61    */ (std::size_t)18446744073709551557ull,
};

You can create a simple C++ program that inserts keys into an unordered_set and then check the number of buckets using the bucket_count method. You will find that it will be one of the above listed prime numbers.

Hash table in .Net (C#)

Now that the .Net source code is available, I also checked out its System.Collections.HashTable implementation. It too seems to be using a lookup table of prime numbers for the table size. Here is the list from its source code:

// Table of prime numbers to use as hash table sizes. 
// A typical resize algorithm would pick the smallest prime number in this array
// that is larger than twice the previous capacity. 
// Suppose our Hashtable currently has capacity x and enough elements are added 
// such that a resize needs to occur. Resizing first computes 2x then finds the 
// first prime in the table greater than 2x, i.e. if primes are ordered 
// p_1, p_2, ..., p_i, ..., it finds p_n such that p_n-1 < 2x < p_n. 
// Doubling is important for preserving the asymptotic complexity of the 
// hashtable operations such as add.  Having a prime guarantees that double 
// hashing does not lead to infinite loops.  IE, your hash function will be 
// h1(key) + i*h2(key), 0 <= i < size.  h2 and the size must be relatively prime.
public static readonly int[] primes = {
    3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
    1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
    17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
    187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
    1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};

The prime numbers in .Net are different from that in GCC STL. I’m guessing they are tuned for the .Net load factors, languages and virtual machine.

A conclusion we can draw from these observations is that though hash tables easily beat balanced binary search tree (BST) in lookup performance, they are not any easier to implement. Especially a hash table that is built for general purpose applications and data sizes.