š 2015-Oct-16 ⬩ āļø Ashwin Nanjappa ⬩ š·ļø cpp, dotnet, gcc, hash table, prime number, stl ⬩ š Archive
Prime numbers are very important in the implementation of hash tables. They might be used in computing a hash for a given key using a hash function. Also, they seem to be commonly used as the size of the hash table, i.e., the number of buckets in the table.
Consider, a hash table with separate chaining. When there is a collision, i.e., multiple keys map to the same bucket, they are maintained in a list at that bucket. The load factor of the hash table is the average number of keys per bucket in the table. When the load factor increases beyond a certain threshold, a new hash table with larger number of buckets is chosen, the existing keys are rehashed and inserted anew into the new hash table.
A key part of a hash table implementation is: what number of buckets to choose when increasing or decreasing the size of a hash table? In containers like vectors, this is easy: in most implementations it grows by 2x the size. But, for hash tables, I found that implementations seem to prefer prime numbers. So, how do hash table implementations pick this prime number?
The containers based on hash tables in C++ STL are unordered_set
, unordered_multiset
, unordered_map
and unordered_multimap
. All of them are based on the same underlying hash table implementation. Computing an optimal prime number for a given load factor and number of keys is not easy.
Not surprisingly, the GCC 5.1 implementation of STL has a pre-computed lookup table of prime numbers. I found it in /usr/include/c++/5/ext/pb_ds/detail/resize_policy/hash_prime_size_policy_imp.hpp
. Here is the array of prime numbers it uses:
static const std::size_t g_a_sizes[num_distinct_sizes_64_bit] =
{
/* 0 */ 5ul,
/* 1 */ 11ul,
/* 2 */ 23ul,
/* 3 */ 47ul,
/* 4 */ 97ul,
/* 5 */ 199ul,
/* 6 */ 409ul,
/* 7 */ 823ul,
/* 8 */ 1741ul,
/* 9 */ 3469ul,
/* 10 */ 6949ul,
/* 11 */ 14033ul,
/* 12 */ 28411ul,
/* 13 */ 57557ul,
/* 14 */ 116731ul,
/* 15 */ 236897ul,
/* 16 */ 480881ul,
/* 17 */ 976369ul,
/* 18 */ 1982627ul,
/* 19 */ 4026031ul,
/* 20 */ 8175383ul,
/* 21 */ 16601593ul,
/* 22 */ 33712729ul,
/* 23 */ 68460391ul,
/* 24 */ 139022417ul,
/* 25 */ 282312799ul,
/* 26 */ 573292817ul,
/* 27 */ 1164186217ul,
/* 28 */ 2364114217ul,
/* 29 */ 4294967291ul,
/* 30 */ (std::size_t)8589934583ull,
/* 31 */ (std::size_t)17179869143ull,
/* 32 */ (std::size_t)34359738337ull,
/* 33 */ (std::size_t)68719476731ull,
/* 34 */ (std::size_t)137438953447ull,
/* 35 */ (std::size_t)274877906899ull,
/* 36 */ (std::size_t)549755813881ull,
/* 37 */ (std::size_t)1099511627689ull,
/* 38 */ (std::size_t)2199023255531ull,
/* 39 */ (std::size_t)4398046511093ull,
/* 40 */ (std::size_t)8796093022151ull,
/* 41 */ (std::size_t)17592186044399ull,
/* 42 */ (std::size_t)35184372088777ull,
/* 43 */ (std::size_t)70368744177643ull,
/* 44 */ (std::size_t)140737488355213ull,
/* 45 */ (std::size_t)281474976710597ull,
/* 46 */ (std::size_t)562949953421231ull,
/* 47 */ (std::size_t)1125899906842597ull,
/* 48 */ (std::size_t)2251799813685119ull,
/* 49 */ (std::size_t)4503599627370449ull,
/* 50 */ (std::size_t)9007199254740881ull,
/* 51 */ (std::size_t)18014398509481951ull,
/* 52 */ (std::size_t)36028797018963913ull,
/* 53 */ (std::size_t)72057594037927931ull,
/* 54 */ (std::size_t)144115188075855859ull,
/* 55 */ (std::size_t)288230376151711717ull,
/* 56 */ (std::size_t)576460752303423433ull,
/* 57 */ (std::size_t)1152921504606846883ull,
/* 58 */ (std::size_t)2305843009213693951ull,
/* 59 */ (std::size_t)4611686018427387847ull,
/* 60 */ (std::size_t)9223372036854775783ull,
/* 61 */ (std::size_t)18446744073709551557ull,
};
You can create a simple C++ program that inserts keys into an unordered_set
and then check the number of buckets using the bucket_count
method. You will find that it will be one of the above listed prime numbers.
Now that the .Net source code is available, I also checked out its System.Collections.HashTable
implementation. It too seems to be using a lookup table of prime numbers for the table size. Here is the list from its source code:
// Table of prime numbers to use as hash table sizes.
// A typical resize algorithm would pick the smallest prime number in this array
// that is larger than twice the previous capacity.
// Suppose our Hashtable currently has capacity x and enough elements are added
// such that a resize needs to occur. Resizing first computes 2x then finds the
// first prime in the table greater than 2x, i.e. if primes are ordered
// p_1, p_2, ..., p_i, ..., it finds p_n such that p_n-1 < 2x < p_n.
// Doubling is important for preserving the asymptotic complexity of the
// hashtable operations such as add. Having a prime guarantees that double
// hashing does not lead to infinite loops. IE, your hash function will be
// h1(key) + i*h2(key), 0 <= i < size. h2 and the size must be relatively prime.
public static readonly int[] primes = {
3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};
The prime numbers in .Net are different from that in GCC STL. Iām guessing they are tuned for the .Net load factors, languages and virtual machine.
A conclusion we can draw from these observations is that though hash tables easily beat balanced binary search tree (BST) in lookup performance, they are not any easier to implement. Especially a hash table that is built for general purpose applications and data sizes.