During a Udacity CS101 class, Prof. David Evans introduces Python dictionaries and mentions that the viewer can learn more about the implementation of this data type in the book Beautiful Code. I looked it up and Chapter 18 of the book titled Python’s Dictionary Implementation: Being All Things to All People written by Andrew Kuchling actually turned out to be quite interesting.
Here are some notes about the dictionary implementation:
- Dictionary is used heavily in Python:
- In Python programs, dictionaries are the most popular data type after lists.
- Python’s implementation of classes uses dictionary underneath to store the class attributes.
- Keyword arguments to functions are implemented using dictionary. So, a dictionary is created and destroyed on each such function call.
- Due to such popular and diverse uses, the dictionary needs to offer fast lookup and insertion, fast creation and destruction and should be conservative on memory use.
- In CPython, dictionary is implemented as a hash table. The number of slots in the table is a power-of-2.
- Each slot has three items: pointer to the key, pointer to the value and a copy of the hash of the key. The hash is cached for fast comparison, instead of having to touch the key (which may be large) and compute hash from that.
- Unlike a typical hash table, the hash of the key is not an index into the table. Instead, the (hash & mask) is the index into the table. Mask is the table-size - 1.
- Open addressing is used to deal with collision. Linear probing, the simplest open addressing technique, is useless because of how a hash is converted to index (see above). Instead, the bits of the hash are used to perturb until an empty slot is found.
- The table is resized when it becomes 2/3 full. The resize factor is chosen based on the number of keys. If less than 50,000 keys, the factor is 4, else 2.
- A lot of the dictionaries used internally by Python have string keys. To optimize for these, a dictionary always begins its life with a string-only lookup function. Only on the first non-string lookup or insertion, the lookup function switched permanently to a slower generic lookup function.
- Most dictionaries tend to be small. To optimize for this, a 8-slot hash table is an integral part of the dictionary object. This small table can fit inside two 64-byte cache lines. A larger table is allocated and it is used instead of the small table when the number of keys increases beyond 5.
- Internally, Python maintains an array of dictionaries that were used and discarded earlier. These are reused whenever possible, instead of creating new dictionaries from scratch.
- I looked up some of the above data structures and functions in CPython source code. Much like Python itself, its implementation is refreshingly simple, easy to read and understand. The relevant files are:
Include/dictobject.h
, Objects/dictobject.c
and Objects/dictnotes.txt
.
I hope you have fun spelunking the dictionary implementation in Python. 😊