**Prime numbers** are very important in the implementation of **hash tables**. They might be used in computing a **hash** for a given **key** using a **hash function**. Also, they seem to be commonly used as the size of the hash table, i.e., the number of **buckets** in the table.

Consider, a hash table with **separate chaining**. When there is a **collision**, i.e., multiple keys map to the same bucket, they are maintained in a list at that bucket. The **load factor** of the hash table is the average number of keys per bucket in the table. When the load factor increases beyond a certain threshold, a new hash table with larger number of buckets is chosen, the existing keys are rehashed and inserted anew into the new hash table.

A key part of a hash table implementation is: what number of buckets to choose when increasing or decreasing the size of a hash table? In containers like vectors, this is easy: in most implementations it grows by 2x the size. But, for hash tables, I found that implementations seem to prefer prime numbers. So, how do hash table implementations pick this prime number?

# Hash table in C++ STL of GCC

The containers based on hash tables in C++ STL are `unordered_set`

, `unordered_multiset`

, `unordered_map`

and `unordered_multimap`

. All of them are based on the same underlying hash table implementation. Computing an optimal prime number for a given load factor and number of keys is not easy.

Not surprisingly, the GCC 5.1 implementation of STL has a pre-computed **lookup table of prime numbers**. I found it in `/usr/include/c++/5/ext/pb_ds/detail/resize_policy/hash_prime_size_policy_imp.hpp`

. Here is the array of prime numbers it uses:

static const std::size_t g_a_sizes[num_distinct_sizes_64_bit] = { /* 0 */ 5ul, /* 1 */ 11ul, /* 2 */ 23ul, /* 3 */ 47ul, /* 4 */ 97ul, /* 5 */ 199ul, /* 6 */ 409ul, /* 7 */ 823ul, /* 8 */ 1741ul, /* 9 */ 3469ul, /* 10 */ 6949ul, /* 11 */ 14033ul, /* 12 */ 28411ul, /* 13 */ 57557ul, /* 14 */ 116731ul, /* 15 */ 236897ul, /* 16 */ 480881ul, /* 17 */ 976369ul, /* 18 */ 1982627ul, /* 19 */ 4026031ul, /* 20 */ 8175383ul, /* 21 */ 16601593ul, /* 22 */ 33712729ul, /* 23 */ 68460391ul, /* 24 */ 139022417ul, /* 25 */ 282312799ul, /* 26 */ 573292817ul, /* 27 */ 1164186217ul, /* 28 */ 2364114217ul, /* 29 */ 4294967291ul, /* 30 */ (std::size_t)8589934583ull, /* 31 */ (std::size_t)17179869143ull, /* 32 */ (std::size_t)34359738337ull, /* 33 */ (std::size_t)68719476731ull, /* 34 */ (std::size_t)137438953447ull, /* 35 */ (std::size_t)274877906899ull, /* 36 */ (std::size_t)549755813881ull, /* 37 */ (std::size_t)1099511627689ull, /* 38 */ (std::size_t)2199023255531ull, /* 39 */ (std::size_t)4398046511093ull, /* 40 */ (std::size_t)8796093022151ull, /* 41 */ (std::size_t)17592186044399ull, /* 42 */ (std::size_t)35184372088777ull, /* 43 */ (std::size_t)70368744177643ull, /* 44 */ (std::size_t)140737488355213ull, /* 45 */ (std::size_t)281474976710597ull, /* 46 */ (std::size_t)562949953421231ull, /* 47 */ (std::size_t)1125899906842597ull, /* 48 */ (std::size_t)2251799813685119ull, /* 49 */ (std::size_t)4503599627370449ull, /* 50 */ (std::size_t)9007199254740881ull, /* 51 */ (std::size_t)18014398509481951ull, /* 52 */ (std::size_t)36028797018963913ull, /* 53 */ (std::size_t)72057594037927931ull, /* 54 */ (std::size_t)144115188075855859ull, /* 55 */ (std::size_t)288230376151711717ull, /* 56 */ (std::size_t)576460752303423433ull, /* 57 */ (std::size_t)1152921504606846883ull, /* 58 */ (std::size_t)2305843009213693951ull, /* 59 */ (std::size_t)4611686018427387847ull, /* 60 */ (std::size_t)9223372036854775783ull, /* 61 */ (std::size_t)18446744073709551557ull, };

You can create a simple C++ program that inserts keys into an `unordered_set`

and then check the number of buckets using the `bucket_count`

method. You will find that it will be one of the above listed prime numbers.

# Hash table in .Net (C#)

Now that the **.Net** source code is available, I also checked out its `System.Collections.HashTable`

implementation. It too seems to be using a lookup table of prime numbers for the table size. Here is the list from its source code:

// Table of prime numbers to use as hash table sizes. // A typical resize algorithm would pick the smallest prime number in this array // that is larger than twice the previous capacity. // Suppose our Hashtable currently has capacity x and enough elements are added // such that a resize needs to occur. Resizing first computes 2x then finds the // first prime in the table greater than 2x, i.e. if primes are ordered // p_1, p_2, ..., p_i, ..., it finds p_n such that p_n-1 < 2x < p_n. // Doubling is important for preserving the asymptotic complexity of the // hashtable operations such as add. Having a prime guarantees that double // hashing does not lead to infinite loops. IE, your hash function will be // h1(key) + i*h2(key), 0 <= i < size. h2 and the size must be relatively prime. public static readonly int[] primes = { 3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919, 1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591, 17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437, 187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263, 1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};

The prime numbers in .Net are different from that in GCC STL. I’m guessing they are tuned for the .Net load factors, languages and virtual machine.

A conclusion we can draw from these observations is that though hash tables easily beat **balanced binary search tree (BST)** in lookup performance, they are not any easier to implement. Especially a hash table that is built for general purpose applications and data sizes.