COMPUTER SCIENCE FUNDAMENTALS SERIES

Hashing
Algorithms

Hash functions · Collision resolution · Open addressing ·
Consistent hashing · Cryptographic hashes · Perfect hashing

Mid-level software engineer track · 20 slides

Hash Function Properties

What is a hash function?

A hash function h(k) maps keys from a universe U to a fixed-size set of integers {0, 1, ..., m-1}, where m is the table size.

Desirable properties

Deterministic — the same key always produces the same hash value
Uniform distribution — keys spread evenly across the output range to minimise collisions
Efficiency — computing the hash should be O(1) or near-constant time
Avalanche effect — a small change in input produces a large change in output
Minimised collisions — different keys should map to different indices as often as possible

The Division Method

Formula

h(k) = k mod m

The simplest hash function. The key k is divided by the table size m, and the remainder is the hash value.

Choosing m

Avoid powers of 2 — k mod 2^p uses only the lowest p bits, ignoring high-order information
Avoid powers of 10 — same issue for decimal-encoded keys
Prefer primes — a prime m not close to a power of 2 gives good distribution
Example: for ~2000 slots, m = 2003 (prime) is better than m = 2048

Example

Keys: 56, 72, 129, 183, 245
m = 7

h(56)  = 56  mod 7 = 0
h(72)  = 72  mod 7 = 2
h(129) = 129 mod 7 = 3
h(183) = 183 mod 7 = 1
h(245) = 245 mod 7 = 0  ← collision with 56

Collision: keys 56 and 245 both map to slot 0. Every hash function must have a strategy for handling this.

Multiplication Method & Universal Hashing

Multiplication method

h(k) = ⌊m × (k × A mod 1)⌋

Where A is a constant in (0, 1). Knuth recommends A ≈ (√5 − 1) / 2 ≈ 0.6180339887.

Works well with any table size m — no need for primes
Extracts information from all bits of the key
Easy to implement with fixed-point arithmetic

Universal hashing

Choose h randomly from a family of hash functions at runtime. A family H is universal if for any two distinct keys x ≠ y:

Pr[h(x) = h(y)] ≤ 1/m

Carter-Wegman family

h_a,b(k) = ((a × k + b) mod p) mod m

Where p is a prime larger than the universe, a ∈ {1, ..., p-1}, b ∈ {0, ..., p-1}.

Key insight: Universal hashing guarantees O(1) expected-time lookups regardless of the input distribution — it defeats adversarial inputs.

Collision Resolution — Chaining

Each table slot holds a pointer to a linked list. All keys that hash to the same index are stored in that list.

Performance

Operation	Average	Worst
Search	`O(1 + α)`	`O(n)`
Insert	`O(1)`	`O(1)`
Delete	`O(1 + α)`	`O(n)`

α = n/m is the load factor.

Trade-offs

Simple to implement; deletion is straightforward
Load factor can exceed 1 — table never "fills up"
Cache-unfriendly due to pointer chasing
Extra memory for pointers (8 bytes each on 64-bit)

Open Addressing — Linear Probing

Concept

All elements stored directly in the table array. On collision, probe the next slot in sequence until an empty slot is found.

h(k, i) = (h'(k) + i) mod m

Primary clustering

Contiguous blocks of occupied slots grow and merge, increasing average probe lengths. A cluster of size k has probability (k+1)/m of growing on the next insert.

Why still popular? Linear probing is extremely cache-friendly — sequential memory access dominates performance. With a good hash function and α < 0.7, it outperforms chaining in practice.

Example

Insert 56, 72, 129, 183, 245
m = 7,  h(k) = k mod 7

h(56)  = 0 → slot 0 ✓
h(72)  = 2 → slot 2 ✓
h(129) = 3 → slot 3 ✓
h(183) = 1 → slot 1 ✓
h(245) = 0 → slot 0 full
             slot 1 full
             slot 2 full
             slot 3 full
             slot 4 ✓

Key 245 required 4 probes due to the cluster at slots 0–3.

Open Addressing — Quadratic Probing

Probe sequence

h(k, i) = (h'(k) + c₁·i + c₂·i²) mod m

Common choice: c₁ = 0, c₂ = 1, giving h(k, i) = (h'(k) + i²) mod m.

Advantages over linear probing

Eliminates primary clustering — probes jump over occupied runs
Still has reasonable cache performance for small probe distances

Caveat

Quadratic probing does not guarantee visiting all slots. To guarantee a full cycle: m must be prime, or m must be a power of 2 with c₁ = c₂ = 1/2.

Secondary clustering

Keys that hash to the same initial slot follow the same probe sequence. This is secondary clustering — less severe than primary clustering but still suboptimal.

In practice: secondary clustering is rarely a bottleneck. Quadratic probing is a solid middle ground between linear probing's cache friendliness and double hashing's uniform distribution.

Open Addressing — Double Hashing

h(k, i) = (h₁(k) + i × h₂(k)) mod m

Two independent hash functions. Each key gets its own step size, so different keys that collide on h₁ almost certainly diverge on subsequent probes.

Requirements for h₂

h₂(k) ≠ 0 for all k — otherwise the probe sequence never advances
h₂(k) must be coprime to m — ensures the full table is probed
Common: h₂(k) = 1 + (k mod (m - 1)) with m prime

Performance comparison

Method	Expected probes (miss)	Clustering
Linear probing	`½(1 + 1/(1−α)²)`	Primary
Quadratic probing	`−ln(1−α) / α`	Secondary
Double hashing	`1/(1−α)`	None

Trade-off: Double hashing gives the closest approximation to uniform probing (the theoretical ideal). The cost is two hash computations per probe instead of one.

Load Factor & Rehashing

Load factor α = n / m

α range	Behaviour
`0.0 – 0.5`	Fast lookups, wasted memory
`0.5 – 0.7`	Good balance for open addressing
`0.7 – 0.9`	Probe lengths grow; chaining still OK
`> 1.0`	Only chaining; degrades linearly

Amortised cost: Each insert is O(1) amortised. A rehash copies all n elements (O(n)), but happens only after Θ(n) insertions.

Rehashing

When α exceeds a threshold, allocate a new table (typically 2m or next prime > 2m), then re-insert every element.

Rehash triggered: α > 0.75
1. Allocate new table of size 2m
2. For each entry in old table:
     compute new_index = h(key) mod 2m
     insert into new table
3. Free old table

Most implementations also shrink the table when α drops below ~0.25 to reclaim memory.

Perfect Hashing

A hash function that maps n keys into a table with zero collisions. Works when the key set is known in advance (static dictionary).

FKS scheme (Fredman, Komlos, Szemeredi, 1984)

Two-level hashing with O(n) space and O(1) worst-case lookup.

Level 1: h₁ maps n keys into m = n slots
         Some slots have collisions (sⱼ keys)

Level 2: For each bucket j with sⱼ keys,
         use secondary table of size sⱼ²
         Choose h₂ⱼ from universal family
         → guaranteed collision-free

Space analysis

Total space across all secondary tables is O(n) in expectation when m = n at level 1. The sⱼ² sizes sound wasteful, but the sum of squares stays linear because the level-1 hash distributes keys well.

Use cases

Compilers — keyword recognition
Network routers — forwarding lookup tables
Read-only dictionaries — build time amortised over millions of lookups

Minimal Perfect Hashing

A perfect hash function where the table size equals exactly n — every slot is occupied, zero waste, zero collisions. h: S → {0, 1, ..., n-1} is a bijection.

Construction algorithms

Algorithm	Build	Bits/key	Lookup
CHD	`O(n)`	~2.07	`O(1)`
BBHash	`O(n)`	~3.0	`O(1)`
RecSplit	`O(n)`	~1.56	`O(1)`
PTHash	`O(n)`	~2.2	`O(1)`

Theoretical lower bound: ~1.44 bits per key. Modern algorithms approach this limit.

Practical use cases

Static key-value stores — database index for immutable data

Language model vocabularies — map tokens to embedding indices

Genome indexing — map k-mers to positions in a reference

Content-addressable storage — deduplicate by content hash

Consistent Hashing

The problem

Standard hash(key) mod n breaks when n (number of servers) changes — nearly all keys remap, causing a rehashing storm.

The ring

Map both keys and servers onto a circular hash space [0, 2³²). Each key is assigned to the next server clockwise on the ring.

Adding or removing a node

When server S₄ joins, only the keys between S₃ and S₄ need to move. All other mappings remain unchanged.

Keys remapped: O(K/n) on average
hash mod n: remaps O(K) keys

Karger et al. (1997) — backbone of Memcached, DynamoDB, Cassandra, Akamai CDN.

Consistent Hashing — Virtual Nodes

The balance problem

With few physical nodes, ring partitions are uneven. One server may own 60% of the key space while another owns 5%.

Virtual nodes (vnodes)

Each physical server is represented by v virtual nodes spread across the ring. More vnodes = more uniform distribution.

Physical S₁ → S₁_0, S₁_1, S₁_2, ...
Physical S₂ → S₂_0, S₂_1, S₂_2, ...

Benefits

Load balance — with v = 150, std dev of load drops below 5%
Heterogeneous capacity — assign more vnodes to more powerful servers
Smoother rebalancing — departing node's vnodes scatter load across many survivors

Implementations

System	Ring mechanism
DynamoDB	Consistent hashing with vnodes & preference lists
Cassandra	Token ring, default 256 vnodes per node
Riak	Ring of 64/128/256 partitions with ownership transfer
Akka Cluster	Consistent hash router with vnodes

Cryptographic Hash Functions

Properties beyond standard hashing

Property	Description
Pre-image resistance	Given `h`, infeasible to find `m` where `H(m) = h`
Second pre-image	Given `m₁`, infeasible to find `m₂ ≠ m₁` with same hash
Collision resistance	Infeasible to find any `m₁ ≠ m₂` with same hash
Avalanche effect	Flipping one input bit changes ~50% of output bits
Fixed output size	Arbitrary-length input → fixed-length digest

Common algorithms

Algorithm	Output	Status
MD5	128 bits	Broken
SHA-1	160 bits	Deprecated
SHA-256	256 bits	Current standard
SHA-3	224–512 bits	NIST standard (Keccak)
BLAKE3	256 bits	Fastest modern hash

SHA-256 & BLAKE3

SHA-256

Merkle-Damgard construction. Processes input in 512-bit blocks through 64 rounds of mixing.

Output: 256 bits (32 bytes, 64 hex chars)
Used in: TLS, Bitcoin, Git, JWT signatures
Speed: ~500 MB/s with SHA-NI extensions

SHA-256("hello") =
2cf24dba5fb0a30e26e83b2ac5b9e29e
1b161e5c1fa7425e7304362938b62494

BLAKE3

Tree-based Merkle construction. Unlimited parallelism across chunks.

Output: 256 bits (extendable)
Speed: ~5 GB/s single-threaded; scales with cores
Used in: Bao, content-addressable storage, build caches

When to use which

Use case	Choice
Standards compliance	SHA-256
Max throughput	BLAKE3
Password storage	Neither — use Argon2/bcrypt

Hash Tables in Practice

Python dict (CPython 3.6+)

Open addressing with pseudo-random probing
Compact layout: dense insertion-order array; hash table holds indices
Resize at α > 2/3; power-of-2 sizes
Guaranteed insertion-order iteration (3.7+)

Java HashMap

Chaining with linked lists; degrades to red-black tree at bucket size 8
Default load factor: 0.75; power-of-2 capacity
Key must implement hashCode() and equals()

C++ std::unordered_map

Chaining (bucket array of linked lists) per the standard
Load factor threshold: 1.0 by default
Cache-unfriendly; often replaced with Abseil flat_hash_map

Rust HashMap (SwissTable)

Open addressing with SIMD metadata probing (hashbrown crate)
1-byte control byte per slot: empty, deleted, or 7 bits of H2
Resize at α > 7/8; extremely cache-efficient

Cuckoo Hashing

Concept

Two hash functions h₁, h₂ and two tables T₁, T₂. Each key lives in exactly one of its two possible positions.

Insert algorithm

def insert(key):
  if T1[h1(key)] is empty:
    place key there; return
  if T2[h2(key)] is empty:
    place key there; return
  # Evict occupant of T1[h1(key)]
  displaced = T1[h1(key)]
  T1[h1(key)] = key
  # Re-insert displaced into alternate
  # Repeat until stable or cycle → rehash

Properties

Property	Value
Lookup	`O(1)` worst case — check 2 locations
Insert	`O(1)` amortised
Space	~50% utilisation (`α ≈ 0.5`)
Delete	`O(1)` — no tombstones

Worst-case O(1) lookup — no long probe chains, no degenerate linked lists. Used in network switches, GPU hash tables, and high-frequency trading.

Robin Hood Hashing

Concept

Open addressing with linear probing, but elements are reordered during insertion to equalise probe distances. "Rob from the rich, give to the poor."

Insert rule

If the new key's probe distance exceeds the current occupant's, swap them and continue inserting the displaced element.

def insert(key):
  dist, idx = 0, h(key)
  while table[idx] is occupied:
    if dist > probe_dist(table[idx]):
      swap(key, table[idx])
      dist = probe_dist(table[idx])
    idx = (idx + 1) % m
    dist += 1
  table[idx] = key

Benefits

Variance reduction — max probe distance is O(log n) with high probability

Cache-friendly — linear probing with bounded displacement

Early termination — stop search when current element's probe distance exceeds search distance

In practice: Robin Hood hashing was the basis of Rust's original HashMap (pre-SwissTable) and many game engine hash maps. Its bounded probe distance makes worst-case performance predictable.

Locality-Sensitive Hashing

Concept

Standard hashes make similar inputs diverge. LSH does the opposite: similar items hash to the same bucket with high probability.

Common LSH families

Distance	LSH family	Use
Jaccard	MinHash	Document dedup
Cosine	SimHash	Semantic similarity
Euclidean	Random proj.	Image search
Hamming	Bit sampling	Fingerprint match

Applications

Approximate nearest-neighbour search in high dimensions
Near-duplicate web page detection (Google, Bing)
Audio fingerprinting (Shazam uses a related technique)
Genome sequence alignment (fast seed generation)

Trade-off: LSH trades exact correctness for massive speed gains. Brute-force ANN in d dimensions is O(n); LSH reduces this to sub-linear time with tuneable accuracy.

Formal definition

A family H is (d₁, d₂, p₁, p₂)-sensitive if close points (dist ≤ d₁) collide with probability ≥ p₁ and far points (dist ≥ d₂) collide with probability ≤ p₂.

Applications

Caches

Hash tables power in-memory caches (Memcached, Redis). Key lookup in O(1) enables microsecond response times. Consistent hashing distributes keys across nodes.

Deduplication

Content-addressable storage hashes each data block. If the hash exists, the block is a duplicate — store a reference instead. Used in Borg, Restic, ZFS, and build caches.

Checksums & integrity

Detect corruption in data transfer and storage. CRC32 for fast error detection; SHA-256 for cryptographic integrity verification.

Bloom filters

Space-efficient probabilistic structure using k hash functions to set bits in a bit array. "Possibly in the set" or "definitely not" — false positives, no false negatives.

~10 bits per element for 1% false positive rate
Used in databases, routers, spell checkers, Chrome safe browsing

More applications

HMAC — authenticate messages with a shared secret
DHTs — Chord, Kademlia (BitTorrent, IPFS)
Load balancing — hash request key to pick backend server
Data partitioning — shard databases by hash of partition key

Summary & Further Reading

Key takeaways

Hash functions must be deterministic, efficient, and distribute keys uniformly
Division method requires careful choice of m; multiplication method is more robust
Chaining tolerates high load factors; open addressing is cache-friendly but load-sensitive
Double hashing eliminates clustering; cuckoo hashing guarantees worst-case O(1) lookup
Robin Hood hashing bounds probe-distance variance for predictable performance
Consistent hashing with vnodes is the backbone of distributed storage
Cryptographic hashes add pre-image and collision resistance — essential for security
Perfect hashing achieves zero collisions for static key sets
LSH inverts the hash contract — similar items hash together for ANN search
Real-world hash tables combine open addressing with SIMD and careful engineering

Source	Description
CLRS	Introduction to Algorithms — Ch. 11, the canonical hash tables reference
Knuth	TAOCP Vol. 3 — Sorting and Searching, Section 6.4
Karger et al.	"Consistent Hashing and Random Trees" (1997)
Mitzenmacher	Probability and Computing — hashing and Bloom filter analysis
Swiss Tables	CppCon 2017 — Matt Kulukundis on Abseil flat_hash_map

HashingAlgorithms

Hash Function Properties

What is a hash function?

Desirable properties

The Division Method

Formula

Choosing m

Example

Multiplication Method & Universal Hashing

Multiplication method

Universal hashing

Carter-Wegman family

Collision Resolution — Chaining

Performance

Trade-offs

Open Addressing — Linear Probing

Concept

Primary clustering

Example

Open Addressing — Quadratic Probing

Probe sequence

Advantages over linear probing

Caveat

Secondary clustering

Open Addressing — Double Hashing

Requirements for h₂

Performance comparison

Load Factor & Rehashing

Load factor α = n / m

Rehashing

Perfect Hashing

FKS scheme (Fredman, Komlos, Szemeredi, 1984)

Space analysis

Use cases

Minimal Perfect Hashing

Construction algorithms

Practical use cases

Consistent Hashing

The problem

The ring

Adding or removing a node

Consistent Hashing — Virtual Nodes

The balance problem

Virtual nodes (vnodes)

Benefits

Implementations

Cryptographic Hash Functions

Properties beyond standard hashing

Common algorithms

SHA-256 & BLAKE3

SHA-256

BLAKE3

When to use which

Hash Tables in Practice

Python dict (CPython 3.6+)

Java HashMap

C++ std::unordered_map

Rust HashMap (SwissTable)

Cuckoo Hashing

Concept

Insert algorithm

Properties

Robin Hood Hashing

Concept

Insert rule

Benefits

Locality-Sensitive Hashing

Concept

Common LSH families

Applications

Formal definition

Applications

Caches

Deduplication

Checksums & integrity

Bloom filters

More applications

Summary & Further Reading

Key takeaways

Recommended reading

Hashing
Algorithms