Prime number buckets in hash table implementations

Prime numbers are very important in the implementation of hash tables. They might be used in computing a hash for a given key using a hash function. Also, they seem to be commonly used as the size of the hash table, i.e., the number of buckets in the table.

Consider, a hash table with separate chaining. When there is a collision, i.e., multiple keys map to the same bucket, they are maintained in a list at that bucket. The load factor of the hash table is the average number of keys per bucket in the table. When the load factor increases beyond a certain threshold, a new hash table with larger number of buckets is chosen, the existing keys are rehashed and inserted anew into the new hash table.

A key part of a hash table implementation is: what number of buckets to choose when increasing or decreasing the size of a hash table? In containers like vectors, this is easy: in most implementations it grows by 2x the size. But, for hash tables, I found that implementations seem to prefer prime numbers. So, how do hash table implementations pick this prime number?

Hash table in C++ STL of GCC

The containers based on hash tables in C++ STL are unordered_set, unordered_multiset, unordered_map and unordered_multimap. All of them are based on the same underlying hash table implementation. Computing an optimal prime number for a given load factor and number of keys is not easy.

Not surprisingly, the GCC 5.1 implementation of STL has a pre-computed lookup table of prime numbers. I found it in /usr/include/c++/5/ext/pb_ds/detail/resize_policy/hash_prime_size_policy_imp.hpp. Here is the array of prime numbers it uses:

static const std::size_t g_a_sizes[num_distinct_sizes_64_bit] =
{
  /* 0     */              5ul,
  /* 1     */              11ul, 
  /* 2     */              23ul, 
  /* 3     */              47ul, 
  /* 4     */              97ul, 
  /* 5     */              199ul, 
  /* 6     */              409ul, 
  /* 7     */              823ul, 
  /* 8     */              1741ul, 
  /* 9     */              3469ul, 
  /* 10    */              6949ul, 
  /* 11    */              14033ul, 
  /* 12    */              28411ul, 
  /* 13    */              57557ul, 
  /* 14    */              116731ul, 
  /* 15    */              236897ul,
  /* 16    */              480881ul, 
  /* 17    */              976369ul,
  /* 18    */              1982627ul, 
  /* 19    */              4026031ul,
  /* 20    */              8175383ul, 
  /* 21    */              16601593ul, 
  /* 22    */              33712729ul,
  /* 23    */              68460391ul, 
  /* 24    */              139022417ul, 
  /* 25    */              282312799ul, 
  /* 26    */              573292817ul, 
  /* 27    */              1164186217ul,
  /* 28    */              2364114217ul, 
  /* 29    */              4294967291ul,
  /* 30    */ (std::size_t)8589934583ull,
  /* 31    */ (std::size_t)17179869143ull,
  /* 32    */ (std::size_t)34359738337ull,
  /* 33    */ (std::size_t)68719476731ull,
  /* 34    */ (std::size_t)137438953447ull,
  /* 35    */ (std::size_t)274877906899ull,
  /* 36    */ (std::size_t)549755813881ull,
  /* 37    */ (std::size_t)1099511627689ull,
  /* 38    */ (std::size_t)2199023255531ull,
  /* 39    */ (std::size_t)4398046511093ull,
  /* 40    */ (std::size_t)8796093022151ull,
  /* 41    */ (std::size_t)17592186044399ull,
  /* 42    */ (std::size_t)35184372088777ull,
  /* 43    */ (std::size_t)70368744177643ull,
  /* 44    */ (std::size_t)140737488355213ull,
  /* 45    */ (std::size_t)281474976710597ull,
  /* 46    */ (std::size_t)562949953421231ull, 
  /* 47    */ (std::size_t)1125899906842597ull,
  /* 48    */ (std::size_t)2251799813685119ull, 
  /* 49    */ (std::size_t)4503599627370449ull,
  /* 50    */ (std::size_t)9007199254740881ull, 
  /* 51    */ (std::size_t)18014398509481951ull,
  /* 52    */ (std::size_t)36028797018963913ull, 
  /* 53    */ (std::size_t)72057594037927931ull,
  /* 54    */ (std::size_t)144115188075855859ull,
  /* 55    */ (std::size_t)288230376151711717ull,
  /* 56    */ (std::size_t)576460752303423433ull,
  /* 57    */ (std::size_t)1152921504606846883ull,
  /* 58    */ (std::size_t)2305843009213693951ull,
  /* 59    */ (std::size_t)4611686018427387847ull,
  /* 60    */ (std::size_t)9223372036854775783ull,
  /* 61    */ (std::size_t)18446744073709551557ull,
};

You can create a simple C++ program that inserts keys into an unordered_set and then check the number of buckets using the bucket_count method. You will find that it will be one of the above listed prime numbers.

Hash table in .Net (C#)

Now that the .Net source code is available, I also checked out its System.Collections.HashTable implementation. It too seems to be using a lookup table of prime numbers for the table size. Here is the list from its source code:

// Table of prime numbers to use as hash table sizes. 
// A typical resize algorithm would pick the smallest prime number in this array
// that is larger than twice the previous capacity. 
// Suppose our Hashtable currently has capacity x and enough elements are added 
// such that a resize needs to occur. Resizing first computes 2x then finds the 
// first prime in the table greater than 2x, i.e. if primes are ordered 
// p_1, p_2, ..., p_i, ..., it finds p_n such that p_n-1 < 2x < p_n. 
// Doubling is important for preserving the asymptotic complexity of the 
// hashtable operations such as add.  Having a prime guarantees that double 
// hashing does not lead to infinite loops.  IE, your hash function will be 
// h1(key) + i*h2(key), 0 <= i < size.  h2 and the size must be relatively prime.
public static readonly int[] primes = {
    3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
    1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
    17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
    187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
    1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};

The prime numbers in .Net are different from that in GCC STL. I’m guessing they are tuned for the .Net load factors, languages and virtual machine.

A conclusion we can draw from these observations is that though hash tables easily beat balanced binary search tree (BST) in lookup performance, they are not any easier to implement. Especially a hash table that is built for general purpose applications and data sizes.

How to enable pretty printing for STL in GDB

Printing a C++ STL container in GDB produces information about the internal structure of the container that is hard to understand:

(gdb) print foo_int_vector
$1 = {<std::_Vector_base<int, std::allocator<int> >> = {_M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, _M_start = 0x603010, _M_finish = 0x60301c, 
      _M_end_of_storage = 0x60301c}}, <No data fields>}

By adding pretty printing support, we can get information about the STL container that is easy to understand:

(gdb) print foo_int_vector
$1 = std::vector of length 3, capacity 3 = {34, 999, 345}

To enable pretty printing is easy:

  • Ensure that the GDB version is 7.0 or later. This means it has support for Python scripting.

  • Check out the Python module to perform pretty printing of STL containers:

$ svn co svn://gcc.gnu.org/svn/gcc/trunk/libstdc++-v3/python stlprettyprinter
  • The GDB that I had in Ubuntu was compiled to run Python 3.4. The Python file printers.py downloaded in the above step gave errors with it. I found an alternate printers.py from here that worked.

  • Open ~/.gdbinit and add these lines:

python
import sys
sys.path.insert(0, '/home/joe/stlprettyprinter')
from libstdcxx.v6.printers import register_libstdcxx_printers
register_libstdcxx_printers (None)
end
  • That is it, this gave me pretty printing of STL containers as shown above! 🙂

References:

Tried with: GDB 7.7 and Ubuntu 14.04

C++: List

A STL list in C++ provides the features of a doubly-linked list. Using the list is similar to using any other STL container, with iterators. Here is a simple example that illustrates the common list operations:

#include <list>
using namespace std;

typedef list<int>               IntList;
typedef IntList::iterator       IListIter;
typedef IntList::const_iterator	IListCIter;

int main()
{
    IntList ilist;                                  // Empty list
    IntList ilist1( 100 );                          // List with 100 elements of 0
    IntList ilist2( 100, 3 );                       // List with 100 elements of 3
    IntList ilist3( ilist2.begin(), ilist2.end() ); // List copied from previous list

    // Create list: 10->20->30
    ilist.push_back( 10 );
    ilist.push_back( 20 );
    ilist.push_back( 30 );

    ilist.size();  // 3
    ilist.empty(); // false

    ilist.front(); // 10
    ilist.back();  // 30

    ilist.insert( ilist.begin(), 99 );          // 99 added, 99->10->20->30
    ilist.erase( ilist.begin() );               // 99 erased, 10->20->30
    ilist.erase( ilist.begin(), ilist.end() );	// Entire list erased

    return 0;
}

C++: STL Stack

The C++ STL provides a simple stack. This is not a STL container, but is rather called a container adapter. That is, it is a simple interface built upon a real STL container.

Using STL stack is super-easy. Push elements using push(). To get the top element call top() and to pop the top element (without returning it) use pop(). So, typically top() and pop() would be used together. empty() is used to check if stack is empty.

#include <stack>
using namespace std;

stack<int> istack;
istack.push( 10 );
int i = istack.top(); // 10
istack.pop();
bool isEmpty = istack.empty(); // true

Visual C++: Iterator Checks and Slow STL Code

Visual C++ turns on a lot of checking on STL iterators when compiled in Debug mode. This is very helpful since any mistake with an iterator leads immediately to an exception with a helpful exception message. This can catch latent bugs much much earlier while programming.

However, this huge amount of security is sometimes not tolerable by everyone. For certain applications the debug mode with all the iterator checking turned on can be extremely slow, to the point of being useless. I faced a similar problem and so had to investigate to find the reason for the slow behaviour. I discovered that there are 2 mutually exclusive checking mechanisms in the STL of Visual C++: checked iterators and iterator debugging.

Checked Iterators

Checked iterators is controlled by the preprocessor definition _SECURE_SCL. SCL is Microsoft jargon for Standard C++ Library. So, this is a feature by Microsoft to provide some amount of minimal security on the usage of STL iterators. The overhead of _SECURE_SCL is so low that by default it is ON for both Release and Debug modes. So, _SECURE_SCL is rarely the cause of your program being slow. It should be left at its default options.

Iterator Debugging

Iterator debugging is a whole another beast. It involves a lot more intensive checking on the validity of iterators. It does this both when iterators are dereferenced and when containers change internally. This was implemented by Dinkumware, who are the creators of the STL implementation that Visual C++ ships with.

By default, iterator debugging is turned ON for Debug mode and OFF for Release mode. Typically it is not slow for Debug mode, but if the C++ code has a lot of loops over STL containers and uses iterators heavily, then it can make everything very slow. How slow can it get? This is the difference I saw in one of my programs compiled in Debug mode:

Iterator Debugging ON: 51.83 sec
Iterator Debugging OFF: 0.27 sec

Yes, iterator debugging was a jaw-dropping 192 times slower! Not just that, at ~52 seconds of execution, this program was unusable for my purpose!

Iterator debugging is controlled by the preprocessor definition _HAS_ITERATOR_DEBUGGING. To turn off iterator debugging, add _HAS_ITERATOR_DEBUGGING=0 to the Preprocessor Definitions in C++ → Preprocessor section of the project properties.

References:

  • STL Iterator Debugging and Secure SCL video on Channel 9 by Stephan T. Lavavej. A must watch video for anyone using STL on Visual C++. Stephan explains both Secure SCL and Iterator Debugging and how they are implemented internally in the library. Also explained is when and which of these options to turn ON or OFF.

C++ STL Vector Growth Factor

The STL vector is a dynamic array. When it grows its internal memory is reallocated and the old content is copied over to the new space. Because of this, when the vector capacity increases the iterators, references and pointers to its elements become invalid.

The capacity increase factor of a vector depends on the STL implementation. The current capacity of a vector can be found by calling its capacity() method. By running a simple program (like below), one can figure out the factor. I found that this factor is 1.5 for Visual C++ 2010, which uses the STL implementation from Dinkumware and it is 2.0 on G++ 3.4.4 running under Cygwin.

#include 
#include 
using namespace std;

int main()
{
	vector ivec;
	for ( int i = 0; i < 1000; ++i )
	{
		const int before = ivec.capacity();
		ivec.push_back( i );
		const int after = ivec.capacity();

		if ( before != after )
			cout << "Capacity: " << after << endl;
	}

	return 0;
}

C++ STL: Push Back Array into Container

Pushing the elements of an array into the back of a STL container, in a single statement, is easy using the STL std::copy algorithm. This algorithm relies on the destination having enough space to hold the source elements. So, a convenient way to apply it on a container without allocating the required space is to use a std::back_inserter. This works on all containers that have a push_back method: vector, list and deque.

#include <algorithm>
#include <iterator>
std::copy( fooArray, fooArray + ARRAY_SIZE, std::back_inserter( fooVector ) );