A dictionary (also called map, associative array) supports three core operations on key-value pairs:
v with key kk, or report absentkHash tables achieve amortised constant time by trading deterministic guarantees for expected-case performance. The key insight: convert the key to an array index via a hash function.
| Data Structure | INSERT | SEARCH | DELETE |
|---|---|---|---|
| Unsorted array | O(1) | O(n) | O(n) |
| Sorted array | O(n) | O(log n) | O(n) |
| Balanced BST | O(log n) | O(log n) | O(log n) |
| Hash table | O(1) exp. | O(1) exp. | O(1) exp. |
If every possible key is in {0, 1, ..., U-1} and U is small, store values in a plain array T[0..U-1].
T[k] = v // INSERT
T[k] // SEARCH
T[k] = NIL // DELETE
All operations are O(1) worst case.
Direct addressing is the conceptual foundation. A hash table generalises it by compressing the key space via a hash function.
U is large (e.g. 64-bit integers, strings), allocating an array of size U is infeasiblen << U keys are used, most of the array is wasted[0, m-1] where m = O(n)A hash function h(k) maps a key from universe U to an index in {0, 1, ..., m-1}.
O(1) or O(|key|)h(k) = k mod m
Simple but sensitive to choice of m. Avoid powers of 2. Choose m as a prime not close to a power of 2.
h(k) = floor(m * (k * A mod 1))
// where 0 < A < 1
// Knuth: A = (sqrt(5) - 1) / 2 ≈ 0.618
| Function | Use case |
|---|---|
| MurmurHash3 | General-purpose, fast, good distribution |
| xxHash | Extremely fast, widely used in checksums |
| SipHash | Hash-flooding resistant — default in Python, Rust |
| FNV-1a | Simple, byte-at-a-time, good for short keys |
| CityHash / FarmHash | Google's fast hashes for strings |
Cryptographic hashes (SHA-256) provide stronger guarantees but are much slower. Use them for security, not hash tables.
Any single deterministic hash function has a worst case: an adversary can craft n keys that all hash to the same slot, degrading every operation to O(n).
A family H is universal if for any two distinct keys x ≠ y:
Pr[h(x) = h(y)] ≤ 1/m
// where h chosen randomly from H
For prime p ≥ |U|:
h_{a,b}(k) = ((a * k + b) mod p) mod m
// a ∈ {1, ..., p-1}, b ∈ {0, ..., p-1}
Expected chain length is n/m regardless of key distribution
No adversary can force worst-case behaviour without knowing a, b
Choosing a new random h at construction time is sufficient — no per-query randomness needed
O(1) expectedEach slot T[i] points to a linked list of all elements whose keys hash to i.
h(k), prepend to list at T[h(k)] — O(1)h(k), traverse list at T[h(k)] — O(chain length)O(chain length)With load factor α = n/m and a good hash function:
αO(1 + α)n = O(m), all operations are O(1) expectedChaining is simple and tolerates α > 1 gracefully. Downside: pointer chasing is cache-unfriendly.
All elements stored directly in the table array. No pointers, no linked lists. On collision, probe a sequence of alternative slots until an empty one is found.
h(k, i) = (h'(k) + i) mod m
// for i = 0, 1, 2, ...
Slots checked sequentially. Cache-friendly but suffers from primary clustering — runs of occupied slots grow and merge.
h(k, i) = (h'(k) + c1*i + c2*i²) mod m
Spreads probes more than linear. Avoids primary clustering but can cause secondary clustering.
| Method | Cache | Clustering | Coverage |
|---|---|---|---|
| Linear | Excellent | Primary | All slots |
| Quadratic | Good | Secondary | May miss if m not prime |
Linear probing is used in practice more than theory suggests — modern CPUs love sequential memory access. Swiss Table proves this decisively.
Primary clustering — occupied runs grow:
↑ cluster of 5 — new inserts extend it further
h(k, i) = (h1(k) + i * h2(k)) mod m
// h2(k) must never return 0
// Common: h2(k) = 1 + (k mod (m - 1)), m prime
1 / (1 - α)(1/α) * ln(1 / (1 - α))As α → 1, performance degrades sharply. Open-addressing tables should resize well before reaching α = 1 — typically at α ∈ [0.5, 0.75].
| Load α | Unsuccessful | Successful |
|---|---|---|
| 0.50 | 2.0 probes | 1.4 probes |
| 0.75 | 4.0 probes | 1.8 probes |
| 0.90 | 10.0 probes | 2.6 probes |
| 0.95 | 20.0 probes | 3.2 probes |
Linear — fast cache, primary clustering
Quadratic — better spread, secondary clustering
Double — no clustering, two hash computations
α = n / m
// n = stored elements, m = table capacity
α can exceed 1.0. Performance degrades linearly: O(1 + α)α must stay below 1.0. Performance degrades non-linearlyMaintain α ≤ c for constant c < 1 by doubling when threshold exceeded:
O(1) expectedO(n) but happens only every O(n) insertions| Load target | Memory overhead | Probe perf. |
|---|---|---|
α ≤ 0.50 | 2x actual data | Excellent |
α ≤ 0.75 | 1.33x actual data | Good |
α ≤ 0.90 | 1.11x actual data | Acceptable / poor |
Most real-world hash tables target α ∈ [0.5, 0.75]. Python's dict resizes at α = 2/3.
For n keys in m slots with universal hashing, the expected longest chain is Θ(log n / log log n).
Typically when α exceeds a threshold (e.g. 0.75). Some implementations also shrink when α drops below a lower bound (e.g. 0.25).
m' = 2m (or next prime > 2m)h(k) mod m' and insert into the new tableCost: O(n) — must rehash every element because the modulus changed.
| Strategy | New size | Amortised |
|---|---|---|
| Doubling | 2m | O(1) |
| 1.5x growth | 1.5m | O(1) (lower const.) |
| Prime stepping | next prime > 2m | O(1), better distribution |
Shrinking is often omitted in practice — the memory cost of an oversized table is usually acceptable, and shrink-grow oscillation near the threshold is expensive.
A variant of open-addressing (typically linear probing) where the probe distance of each element is tracked.
During insertion, if the new element's probe distance exceeds the occupant's, swap them — "steal from the rich, give to the poor."
Insert key K with probe distance d:
While slot is occupied:
If d > occupant's probe distance:
Swap K with occupant
Continue inserting displaced element
Advance to next slot, d++
Place K in empty slot
Reduced variance — all elements have similar probe lengths
Expected max probe — O(log log n) vs O(log n) for standard linear probing
Early termination — lookup stops if current probe distance exceeds occupant's
Simpler deletion — backward-shift instead of tombstones
Robin Hood hashing makes linear probing practical at higher load factors. Rust's HashMap used Robin Hood hashing before switching to hashbrown (Swiss Table).
Uses two hash functions h1 and h2 and two tables T1, T2.
T1[h1(k)] and T2[h2(k)] — exactly 2 lookups, alwaysO(1) worst caseInsert key k:
If T1[h1(k)] empty -> place k, done
Evict x from T1[h1(k)], place k
If T2[h2(x)] empty -> place x, done
Evict y from T2[h2(x)], place x
... repeat (max iteration limit)
If cycle -> rehash with new functions
| Property | Value |
|---|---|
| Lookup | O(1) worst case |
| Deletion | O(1) worst case |
| Insertion | O(1) amortised (with rehash) |
| Max load factor | ~50% (two tables) or higher with buckets |
| Rehash trigger | Insertion cycle detected |
Cuckoo hashing guarantees constant-time lookups — ideal for network routers, hardware implementations, and read-heavy workloads.
T1
T2
Each slot has a "neighbourhood" of H consecutive slots (typically H = 32). An element hashing to slot i must reside within i to i + H - 1.
Check all H slots in the neighbourhood — a single cache line or two. Uses a bitmap per slot to record which neighbours are occupied by elements hashing to that slot.
Bounded worst-case probe — at most H slots
Cache-friendly — neighbourhood fits in 1–2 cache lines
Concurrency-friendly — lock only the neighbourhood, not the whole table
No tombstones — just clear the bitmap bit and the slot
Hopscotch hashing combines the cache-friendliness of linear probing with a bounded probe guarantee. Used in Cliff Click's lock-free concurrent hash map.
Neighbourhood of slot 3 (H=4):
↑ bitmap: 1101 — elements at slots 3, 4, 6
Open-addressing with linear probing, but with SIMD-accelerated metadata lookups. Used in Abseil (absl::flat_hash_map) and Rust's hashbrown.
1. h1 = hash(key)
2. group = h1 mod num_groups
3. H2 = top 7 bits of hash
4. Load 16 ctrl bytes → SIMD register
5. Compare all 16 vs H2 in ONE instr
(SSE2 _mm_cmpeq_epi8)
6. Bitmask → check only matches
| Property | Value |
|---|---|
| Probe unit | 16 slots at once (SIMD) |
| Empty check | Single SIMD comparison |
| Load factor | Up to ~87.5% efficiently |
| Deletion | "deleted" sentinel — no tombstone clustering |
| Memory layout | Keys/values stored flat, no pointers |
Swiss Table is the state of the art for general-purpose hash tables. It proves that linear probing + SIMD beats more complex schemes in practice.
One group (16 control bytes):
H2 match · FF=empty · FE=deleted
A perfect hash function maps n keys to m slots with no collisions. A minimal perfect hash uses m = n.
Level 1: Hash n keys into m primary slots using a universal hash function.
Level 2: For each primary slot j with n_j keys, build a secondary table of size n_j² with its own universal hash function.
Level 1: h1(k) → primary slot j
Level 2: h_j(k) → position in
secondary table for slot j
| Property | Value |
|---|---|
| Lookup | O(1) worst case — 2 hashes, 2 lookups |
| Total space | O(n) expected |
| Construction | O(n) expected |
| Static only | Keys must be known in advance |
Perfect hashing is optimal for static dictionaries — compiler keyword tables, routing tables with fixed prefix sets. For dynamic sets, cuckoo hashing is the closest analogue.
α = 2/3; growth ~3x small, ~2x largeα = 0.75, resize to 2 * capacityO(log n) per bucket)(h = key.hashCode()) ^ (h >>> 16)| Feature | Python dict | Java HashMap |
|---|---|---|
| Strategy | Open addressing | Separate chaining |
| Order preserved | Yes (insertion order) | No (use LinkedHashMap) |
| Treeification | No | Yes (chains > 8) |
| Hash function | SipHash | hashCode() XOR-shifted |
| Default load factor | 0.67 | 0.75 |
max_load_factor = 1.0| Issue | Cause |
|---|---|
| Pointer chasing | Each node is a separate heap allocation |
| Cache misses | Nodes scattered across memory |
| High load factor | Default 1.0 means long chains |
| Iterator stability | Standard requires refs stay valid — prevents open addressing |
For performance-critical C++ code, use absl::flat_hash_map (Swiss Table), robin_map (Tessil), or ankerl::unordered_dense.
Concurrent reads and writes require synchronisation. A global lock serialises all operations and destroys throughput.
synchronized on bucket head| Implementation | Read | Write | Resize |
|---|---|---|---|
| Global lock | Serial | Serial | Blocking |
| Lock striping | Parallel | Parallel | Per-segment |
| CAS per bucket | Lock-free | Lock-free | Cooperative |
| RCU | Wait-free | Lock write | Copy-on-write |
For read-heavy workloads, RCU or lock-free maps offer the best throughput. For mixed workloads, CAS-based per-bucket locking (Java 8 ConcurrentHashMap) is the industry standard.
Symbol tables — compilers, interpreters, debuggers
Caching — memoisation, LRU caches (hash map + doubly-linked list)
Databases — hash indexes, hash joins, in-memory tables
Networking — connection tracking, NAT tables, routing tables
Deduplication — seen-set for crawlers, stream processing
Counting — frequency tables, word counts, histograms
Sets — membership testing (hash set = hash map with no values)
No ordering — no range queries, min/max, successor (use BST / B-tree)
Hash quality matters — poor functions cause clustering and O(n)
Hash-flooding attacks — adversarial inputs force worst case (mitigate with SipHash / random seeds)
Memory overhead — open addressing wastes empty slots; chaining wastes pointers
Resizing latency — rehashing n elements pauses the system (mitigate with incremental rehash)
Not scan-optimal — iterating all elements slower than a flat array
When you need sorted iteration or range queries, use a balanced BST (std::map, TreeMap) or B-tree.
O(1) expected time for insert, search, delete — the fastest dictionary for point queriesO(1) worst-case lookupO(1) worst case for static key sets| Source | Description |
|---|---|
| CLRS | Introduction to Algorithms, chapters 11–12 |
| Knuth | TAOCP Vol. 3 — Sorting & Searching, Section 6.4 |
| Herlihy & Shavit | The Art of Multiprocessor Programming |
| Abseil | Swiss Table design notes |
| Pagh & Rodler | "Cuckoo Hashing" — J. Algorithms, 2004 |
| Celis et al. | "Robin Hood Hashing" — FOCS 1985 |