A hash function h(k) maps keys from a universe U to a fixed-size set of integers {0, 1, ..., m-1}, where m is the table size.
O(1) or near-constant timeh(k) = k mod m
The simplest hash function. The key k is divided by the table size m, and the remainder is the hash value.
k mod 2^p uses only the lowest p bits, ignoring high-order informationm not close to a power of 2 gives good distributionm = 2003 (prime) is better than m = 2048Keys: 56, 72, 129, 183, 245
m = 7
h(56) = 56 mod 7 = 0
h(72) = 72 mod 7 = 2
h(129) = 129 mod 7 = 3
h(183) = 183 mod 7 = 1
h(245) = 245 mod 7 = 0 ← collision with 56
Collision: keys 56 and 245 both map to slot 0. Every hash function must have a strategy for handling this.
h(k) = ⌊m × (k × A mod 1)⌋
Where A is a constant in (0, 1). Knuth recommends A ≈ (√5 − 1) / 2 ≈ 0.6180339887.
m — no need for primesChoose h randomly from a family of hash functions at runtime. A family H is universal if for any two distinct keys x ≠ y:
Pr[h(x) = h(y)] ≤ 1/m
ha,b(k) = ((a × k + b) mod p) mod m
Where p is a prime larger than the universe, a ∈ {1, ..., p-1}, b ∈ {0, ..., p-1}.
Key insight: Universal hashing guarantees O(1) expected-time lookups regardless of the input distribution — it defeats adversarial inputs.
Each table slot holds a pointer to a linked list. All keys that hash to the same index are stored in that list.
| Operation | Average | Worst |
|---|---|---|
| Search | O(1 + α) | O(n) |
| Insert | O(1) | O(1) |
| Delete | O(1 + α) | O(n) |
α = n/m is the load factor.
All elements stored directly in the table array. On collision, probe the next slot in sequence until an empty slot is found.
h(k, i) = (h'(k) + i) mod m
Contiguous blocks of occupied slots grow and merge, increasing average probe lengths. A cluster of size k has probability (k+1)/m of growing on the next insert.
Why still popular? Linear probing is extremely cache-friendly — sequential memory access dominates performance. With a good hash function and α < 0.7, it outperforms chaining in practice.
Insert 56, 72, 129, 183, 245
m = 7, h(k) = k mod 7
h(56) = 0 → slot 0 ✓
h(72) = 2 → slot 2 ✓
h(129) = 3 → slot 3 ✓
h(183) = 1 → slot 1 ✓
h(245) = 0 → slot 0 full
slot 1 full
slot 2 full
slot 3 full
slot 4 ✓
Key 245 required 4 probes due to the cluster at slots 0–3.
h(k, i) = (h'(k) + c₁·i + c₂·i²) mod m
Common choice: c₁ = 0, c₂ = 1, giving h(k, i) = (h'(k) + i²) mod m.
Quadratic probing does not guarantee visiting all slots. To guarantee a full cycle: m must be prime, or m must be a power of 2 with c₁ = c₂ = 1/2.
Keys that hash to the same initial slot follow the same probe sequence. This is secondary clustering — less severe than primary clustering but still suboptimal.
In practice: secondary clustering is rarely a bottleneck. Quadratic probing is a solid middle ground between linear probing's cache friendliness and double hashing's uniform distribution.
h(k, i) = (h₁(k) + i × h₂(k)) mod m
Two independent hash functions. Each key gets its own step size, so different keys that collide on h₁ almost certainly diverge on subsequent probes.
h₂(k) ≠ 0 for all k — otherwise the probe sequence never advancesh₂(k) must be coprime to m — ensures the full table is probedh₂(k) = 1 + (k mod (m - 1)) with m prime| Method | Expected probes (miss) | Clustering |
|---|---|---|
| Linear probing | ½(1 + 1/(1−α)²) | Primary |
| Quadratic probing | −ln(1−α) / α | Secondary |
| Double hashing | 1/(1−α) | None |
Trade-off: Double hashing gives the closest approximation to uniform probing (the theoretical ideal). The cost is two hash computations per probe instead of one.
| α range | Behaviour |
|---|---|
0.0 – 0.5 | Fast lookups, wasted memory |
0.5 – 0.7 | Good balance for open addressing |
0.7 – 0.9 | Probe lengths grow; chaining still OK |
> 1.0 | Only chaining; degrades linearly |
Amortised cost: Each insert is O(1) amortised. A rehash copies all n elements (O(n)), but happens only after Θ(n) insertions.
When α exceeds a threshold, allocate a new table (typically 2m or next prime > 2m), then re-insert every element.
Rehash triggered: α > 0.75
1. Allocate new table of size 2m
2. For each entry in old table:
compute new_index = h(key) mod 2m
insert into new table
3. Free old table
Most implementations also shrink the table when α drops below ~0.25 to reclaim memory.
A hash function that maps n keys into a table with zero collisions. Works when the key set is known in advance (static dictionary).
Two-level hashing with O(n) space and O(1) worst-case lookup.
Level 1: h₁ maps n keys into m = n slots
Some slots have collisions (sⱼ keys)
Level 2: For each bucket j with sⱼ keys,
use secondary table of size sⱼ²
Choose h₂ⱼ from universal family
→ guaranteed collision-free
Total space across all secondary tables is O(n) in expectation when m = n at level 1. The sⱼ² sizes sound wasteful, but the sum of squares stays linear because the level-1 hash distributes keys well.
A perfect hash function where the table size equals exactly n — every slot is occupied, zero waste, zero collisions. h: S → {0, 1, ..., n-1} is a bijection.
| Algorithm | Build | Bits/key | Lookup |
|---|---|---|---|
| CHD | O(n) | ~2.07 | O(1) |
| BBHash | O(n) | ~3.0 | O(1) |
| RecSplit | O(n) | ~1.56 | O(1) |
| PTHash | O(n) | ~2.2 | O(1) |
Theoretical lower bound: ~1.44 bits per key. Modern algorithms approach this limit.
Static key-value stores — database index for immutable data
Language model vocabularies — map tokens to embedding indices
Genome indexing — map k-mers to positions in a reference
Content-addressable storage — deduplicate by content hash
Standard hash(key) mod n breaks when n (number of servers) changes — nearly all keys remap, causing a rehashing storm.
Map both keys and servers onto a circular hash space [0, 2³²). Each key is assigned to the next server clockwise on the ring.
When server S₄ joins, only the keys between S₃ and S₄ need to move. All other mappings remain unchanged.
O(K/n) on averageO(K) keysKarger et al. (1997) — backbone of Memcached, DynamoDB, Cassandra, Akamai CDN.
With few physical nodes, ring partitions are uneven. One server may own 60% of the key space while another owns 5%.
Each physical server is represented by v virtual nodes spread across the ring. More vnodes = more uniform distribution.
Physical S₁ → S₁_0, S₁_1, S₁_2, ...
Physical S₂ → S₂_0, S₂_1, S₂_2, ...
v = 150, std dev of load drops below 5%| System | Ring mechanism |
|---|---|
| DynamoDB | Consistent hashing with vnodes & preference lists |
| Cassandra | Token ring, default 256 vnodes per node |
| Riak | Ring of 64/128/256 partitions with ownership transfer |
| Akka Cluster | Consistent hash router with vnodes |
| Property | Description |
|---|---|
| Pre-image resistance | Given h, infeasible to find m where H(m) = h |
| Second pre-image | Given m₁, infeasible to find m₂ ≠ m₁ with same hash |
| Collision resistance | Infeasible to find any m₁ ≠ m₂ with same hash |
| Avalanche effect | Flipping one input bit changes ~50% of output bits |
| Fixed output size | Arbitrary-length input → fixed-length digest |
| Algorithm | Output | Status |
|---|---|---|
| MD5 | 128 bits | Broken |
| SHA-1 | 160 bits | Deprecated |
| SHA-256 | 256 bits | Current standard |
| SHA-3 | 224–512 bits | NIST standard (Keccak) |
| BLAKE3 | 256 bits | Fastest modern hash |
Merkle-Damgard construction. Processes input in 512-bit blocks through 64 rounds of mixing.
SHA-256("hello") =
2cf24dba5fb0a30e26e83b2ac5b9e29e
1b161e5c1fa7425e7304362938b62494
Tree-based Merkle construction. Unlimited parallelism across chunks.
| Use case | Choice |
|---|---|
| Standards compliance | SHA-256 |
| Max throughput | BLAKE3 |
| Password storage | Neither — use Argon2/bcrypt |
α > 2/3; power-of-2 sizes0.75; power-of-2 capacityhashCode() and equals()1.0 by defaultα > 7/8; extremely cache-efficientTwo hash functions h₁, h₂ and two tables T₁, T₂. Each key lives in exactly one of its two possible positions.
def insert(key):
if T1[h1(key)] is empty:
place key there; return
if T2[h2(key)] is empty:
place key there; return
# Evict occupant of T1[h1(key)]
displaced = T1[h1(key)]
T1[h1(key)] = key
# Re-insert displaced into alternate
# Repeat until stable or cycle → rehash
| Property | Value |
|---|---|
| Lookup | O(1) worst case — check 2 locations |
| Insert | O(1) amortised |
| Space | ~50% utilisation (α ≈ 0.5) |
| Delete | O(1) — no tombstones |
Worst-case O(1) lookup — no long probe chains, no degenerate linked lists. Used in network switches, GPU hash tables, and high-frequency trading.
Open addressing with linear probing, but elements are reordered during insertion to equalise probe distances. "Rob from the rich, give to the poor."
If the new key's probe distance exceeds the current occupant's, swap them and continue inserting the displaced element.
def insert(key):
dist, idx = 0, h(key)
while table[idx] is occupied:
if dist > probe_dist(table[idx]):
swap(key, table[idx])
dist = probe_dist(table[idx])
idx = (idx + 1) % m
dist += 1
table[idx] = key
Variance reduction — max probe distance is O(log n) with high probability
Cache-friendly — linear probing with bounded displacement
Early termination — stop search when current element's probe distance exceeds search distance
In practice: Robin Hood hashing was the basis of Rust's original HashMap (pre-SwissTable) and many game engine hash maps. Its bounded probe distance makes worst-case performance predictable.
Standard hashes make similar inputs diverge. LSH does the opposite: similar items hash to the same bucket with high probability.
| Distance | LSH family | Use |
|---|---|---|
| Jaccard | MinHash | Document dedup |
| Cosine | SimHash | Semantic similarity |
| Euclidean | Random proj. | Image search |
| Hamming | Bit sampling | Fingerprint match |
Trade-off: LSH trades exact correctness for massive speed gains. Brute-force ANN in d dimensions is O(n); LSH reduces this to sub-linear time with tuneable accuracy.
A family H is (d₁, d₂, p₁, p₂)-sensitive if close points (dist ≤ d₁) collide with probability ≥ p₁ and far points (dist ≥ d₂) collide with probability ≤ p₂.
Hash tables power in-memory caches (Memcached, Redis). Key lookup in O(1) enables microsecond response times. Consistent hashing distributes keys across nodes.
Content-addressable storage hashes each data block. If the hash exists, the block is a duplicate — store a reference instead. Used in Borg, Restic, ZFS, and build caches.
Detect corruption in data transfer and storage. CRC32 for fast error detection; SHA-256 for cryptographic integrity verification.
Space-efficient probabilistic structure using k hash functions to set bits in a bit array. "Possibly in the set" or "definitely not" — false positives, no false negatives.
m; multiplication method is more robustO(1) lookup| Source | Description |
|---|---|
| CLRS | Introduction to Algorithms — Ch. 11, the canonical hash tables reference |
| Knuth | TAOCP Vol. 3 — Sorting and Searching, Section 6.4 |
| Karger et al. | "Consistent Hashing and Random Trees" (1997) |
| Mitzenmacher | Probability and Computing — hashing and Bloom filter analysis |
| Swiss Tables | CppCon 2017 — Matt Kulukundis on Abseil flat_hash_map |