COMPUTER SCIENCE FUNDAMENTALS SERIES

Hashing
Algorithms

Hash functions · Collision resolution · Open addressing ·
Consistent hashing · Cryptographic hashes · Perfect hashing
Mid-level software engineer track  ·  20 slides
02

Hash Function Properties

What is a hash function?

A hash function h(k) maps keys from a universe U to a fixed-size set of integers {0, 1, ..., m-1}, where m is the table size.

Desirable properties

  • Deterministic — the same key always produces the same hash value
  • Uniform distribution — keys spread evenly across the output range to minimise collisions
  • Efficiency — computing the hash should be O(1) or near-constant time
  • Avalanche effect — a small change in input produces a large change in output
  • Minimised collisions — different keys should map to different indices as often as possible
Keys (Universe U) "alice" "bob" "carol" "dave" h(k) mod m Slots {0..m-1} 0: "bob" 1: — 2: "alice" 3: "carol" 4: — 5: "dave" When |U| > m, collisions are inevitable (pigeonhole principle)
03

The Division Method

Formula

h(k) = k mod m

The simplest hash function. The key k is divided by the table size m, and the remainder is the hash value.

Choosing m

  • Avoid powers of 2k mod 2^p uses only the lowest p bits, ignoring high-order information
  • Avoid powers of 10 — same issue for decimal-encoded keys
  • Prefer primes — a prime m not close to a power of 2 gives good distribution
  • Example: for ~2000 slots, m = 2003 (prime) is better than m = 2048

Example

Keys: 56, 72, 129, 183, 245
m = 7

h(56)  = 56  mod 7 = 0
h(72)  = 72  mod 7 = 2
h(129) = 129 mod 7 = 3
h(183) = 183 mod 7 = 1
h(245) = 245 mod 7 = 0  ← collision with 56

Collision: keys 56 and 245 both map to slot 0. Every hash function must have a strategy for handling this.

04

Multiplication Method & Universal Hashing

Multiplication method

h(k) = ⌊m × (k × A mod 1)⌋

Where A is a constant in (0, 1). Knuth recommends A ≈ (√5 − 1) / 2 ≈ 0.6180339887.

  • Works well with any table size m — no need for primes
  • Extracts information from all bits of the key
  • Easy to implement with fixed-point arithmetic

Universal hashing

Choose h randomly from a family of hash functions at runtime. A family H is universal if for any two distinct keys x ≠ y:

Pr[h(x) = h(y)] ≤ 1/m

Carter-Wegman family

ha,b(k) = ((a × k + b) mod p) mod m

Where p is a prime larger than the universe, a ∈ {1, ..., p-1}, b ∈ {0, ..., p-1}.

Key insight: Universal hashing guarantees O(1) expected-time lookups regardless of the input distribution — it defeats adversarial inputs.

05

Collision Resolution — Chaining

Each table slot holds a pointer to a linked list. All keys that hash to the same index are stored in that list.

0: 56 245 → ∅ 1: 183 → ∅ 2: 72 → ∅ 3: 129 → ∅ 4: 5: 6:

Performance

OperationAverageWorst
SearchO(1 + α)O(n)
InsertO(1)O(1)
DeleteO(1 + α)O(n)

α = n/m is the load factor.

Trade-offs

  • Simple to implement; deletion is straightforward
  • Load factor can exceed 1 — table never "fills up"
  • Cache-unfriendly due to pointer chasing
  • Extra memory for pointers (8 bytes each on 64-bit)
06

Open Addressing — Linear Probing

Concept

All elements stored directly in the table array. On collision, probe the next slot in sequence until an empty slot is found.

h(k, i) = (h'(k) + i) mod m

Primary clustering

Contiguous blocks of occupied slots grow and merge, increasing average probe lengths. A cluster of size k has probability (k+1)/m of growing on the next insert.

Why still popular? Linear probing is extremely cache-friendly — sequential memory access dominates performance. With a good hash function and α < 0.7, it outperforms chaining in practice.

Example

Insert 56, 72, 129, 183, 245
m = 7,  h(k) = k mod 7

h(56)  = 0 → slot 0 ✓
h(72)  = 2 → slot 2 ✓
h(129) = 3 → slot 3 ✓
h(183) = 1 → slot 1 ✓
h(245) = 0 → slot 0 full
             slot 1 full
             slot 2 full
             slot 3 full
             slot 4 ✓

Key 245 required 4 probes due to the cluster at slots 0–3.

07

Open Addressing — Quadratic Probing

Probe sequence

h(k, i) = (h'(k) + c₁·i + c₂·i²) mod m

Common choice: c₁ = 0, c₂ = 1, giving h(k, i) = (h'(k) + i²) mod m.

Advantages over linear probing

  • Eliminates primary clustering — probes jump over occupied runs
  • Still has reasonable cache performance for small probe distances

Caveat

Quadratic probing does not guarantee visiting all slots. To guarantee a full cycle: m must be prime, or m must be a power of 2 with c₁ = c₂ = 1/2.

Secondary clustering

Keys that hash to the same initial slot follow the same probe sequence. This is secondary clustering — less severe than primary clustering but still suboptimal.

0 1 2 3 4 5 6 i=0 +1 +4 start skip land Probes: 0, +1, +4, +9, +16 ...

In practice: secondary clustering is rarely a bottleneck. Quadratic probing is a solid middle ground between linear probing's cache friendliness and double hashing's uniform distribution.

08

Open Addressing — Double Hashing

h(k, i) = (h₁(k) + i × h₂(k)) mod m

Two independent hash functions. Each key gets its own step size, so different keys that collide on h₁ almost certainly diverge on subsequent probes.

Requirements for h₂

  • h₂(k) ≠ 0 for all k — otherwise the probe sequence never advances
  • h₂(k) must be coprime to m — ensures the full table is probed
  • Common: h₂(k) = 1 + (k mod (m - 1)) with m prime

Performance comparison

MethodExpected probes (miss)Clustering
Linear probing½(1 + 1/(1−α)²)Primary
Quadratic probing−ln(1−α) / αSecondary
Double hashing1/(1−α)None

Trade-off: Double hashing gives the closest approximation to uniform probing (the theoretical ideal). The cost is two hash computations per probe instead of one.

09

Load Factor & Rehashing

Load factor α = n / m

α rangeBehaviour
0.0 – 0.5Fast lookups, wasted memory
0.5 – 0.7Good balance for open addressing
0.7 – 0.9Probe lengths grow; chaining still OK
> 1.0Only chaining; degrades linearly

Amortised cost: Each insert is O(1) amortised. A rehash copies all n elements (O(n)), but happens only after Θ(n) insertions.

Rehashing

When α exceeds a threshold, allocate a new table (typically 2m or next prime > 2m), then re-insert every element.

Rehash triggered: α > 0.75
1. Allocate new table of size 2m
2. For each entry in old table:
     compute new_index = h(key) mod 2m
     insert into new table
3. Free old table

Most implementations also shrink the table when α drops below ~0.25 to reclaim memory.

10

Perfect Hashing

A hash function that maps n keys into a table with zero collisions. Works when the key set is known in advance (static dictionary).

FKS scheme (Fredman, Komlos, Szemeredi, 1984)

Two-level hashing with O(n) space and O(1) worst-case lookup.

Level 1: h₁ maps n keys into m = n slots
         Some slots have collisions (sⱼ keys)

Level 2: For each bucket j with sⱼ keys,
         use secondary table of size sⱼ²
         Choose h₂ⱼ from universal family
         → guaranteed collision-free

Space analysis

Total space across all secondary tables is O(n) in expectation when m = n at level 1. The sⱼ² sizes sound wasteful, but the sum of squares stays linear because the level-1 hash distributes keys well.

Use cases

  • Compilers — keyword recognition
  • Network routers — forwarding lookup tables
  • Read-only dictionaries — build time amortised over millions of lookups
11

Minimal Perfect Hashing

A perfect hash function where the table size equals exactly n — every slot is occupied, zero waste, zero collisions. h: S → {0, 1, ..., n-1} is a bijection.

Construction algorithms

AlgorithmBuildBits/keyLookup
CHDO(n)~2.07O(1)
BBHashO(n)~3.0O(1)
RecSplitO(n)~1.56O(1)
PTHashO(n)~2.2O(1)

Theoretical lower bound: ~1.44 bits per key. Modern algorithms approach this limit.

Practical use cases

Static key-value stores — database index for immutable data

Language model vocabularies — map tokens to embedding indices

Genome indexing — map k-mers to positions in a reference

Content-addressable storage — deduplicate by content hash

12

Consistent Hashing

The problem

Standard hash(key) mod n breaks when n (number of servers) changes — nearly all keys remap, causing a rehashing storm.

The ring

Map both keys and servers onto a circular hash space [0, 2³²). Each key is assigned to the next server clockwise on the ring.

Adding or removing a node

When server S₄ joins, only the keys between S₃ and S₄ need to move. All other mappings remain unchanged.

  • Keys remapped: O(K/n) on average
  • hash mod n: remaps O(K) keys
S₁ S₂ S₃ k₁ k₂ k₃ k₄ 0 Hash space [0, 2³²)

Karger et al. (1997) — backbone of Memcached, DynamoDB, Cassandra, Akamai CDN.

13

Consistent Hashing — Virtual Nodes

The balance problem

With few physical nodes, ring partitions are uneven. One server may own 60% of the key space while another owns 5%.

Virtual nodes (vnodes)

Each physical server is represented by v virtual nodes spread across the ring. More vnodes = more uniform distribution.

Physical S₁ → S₁_0, S₁_1, S₁_2, ...
Physical S₂ → S₂_0, S₂_1, S₂_2, ...

Benefits

  • Load balance — with v = 150, std dev of load drops below 5%
  • Heterogeneous capacity — assign more vnodes to more powerful servers
  • Smoother rebalancing — departing node's vnodes scatter load across many survivors

Implementations

SystemRing mechanism
DynamoDBConsistent hashing with vnodes & preference lists
CassandraToken ring, default 256 vnodes per node
RiakRing of 64/128/256 partitions with ownership transfer
Akka ClusterConsistent hash router with vnodes
14

Cryptographic Hash Functions

Properties beyond standard hashing

PropertyDescription
Pre-image resistanceGiven h, infeasible to find m where H(m) = h
Second pre-imageGiven m₁, infeasible to find m₂ ≠ m₁ with same hash
Collision resistanceInfeasible to find any m₁ ≠ m₂ with same hash
Avalanche effectFlipping one input bit changes ~50% of output bits
Fixed output sizeArbitrary-length input → fixed-length digest

Common algorithms

AlgorithmOutputStatus
MD5128 bitsBroken
SHA-1160 bitsDeprecated
SHA-256256 bitsCurrent standard
SHA-3224–512 bitsNIST standard (Keccak)
BLAKE3256 bitsFastest modern hash
15

SHA-256 & BLAKE3

SHA-256

Merkle-Damgard construction. Processes input in 512-bit blocks through 64 rounds of mixing.

  • Output: 256 bits (32 bytes, 64 hex chars)
  • Used in: TLS, Bitcoin, Git, JWT signatures
  • Speed: ~500 MB/s with SHA-NI extensions
SHA-256("hello") =
2cf24dba5fb0a30e26e83b2ac5b9e29e
1b161e5c1fa7425e7304362938b62494

BLAKE3

Tree-based Merkle construction. Unlimited parallelism across chunks.

  • Output: 256 bits (extendable)
  • Speed: ~5 GB/s single-threaded; scales with cores
  • Used in: Bao, content-addressable storage, build caches

When to use which

Use caseChoice
Standards complianceSHA-256
Max throughputBLAKE3
Password storageNeither — use Argon2/bcrypt
16

Hash Tables in Practice

Python dict (CPython 3.6+)

  • Open addressing with pseudo-random probing
  • Compact layout: dense insertion-order array; hash table holds indices
  • Resize at α > 2/3; power-of-2 sizes
  • Guaranteed insertion-order iteration (3.7+)

Java HashMap

  • Chaining with linked lists; degrades to red-black tree at bucket size 8
  • Default load factor: 0.75; power-of-2 capacity
  • Key must implement hashCode() and equals()

C++ std::unordered_map

  • Chaining (bucket array of linked lists) per the standard
  • Load factor threshold: 1.0 by default
  • Cache-unfriendly; often replaced with Abseil flat_hash_map

Rust HashMap (SwissTable)

  • Open addressing with SIMD metadata probing (hashbrown crate)
  • 1-byte control byte per slot: empty, deleted, or 7 bits of H2
  • Resize at α > 7/8; extremely cache-efficient
17

Cuckoo Hashing

Concept

Two hash functions h₁, h₂ and two tables T₁, T₂. Each key lives in exactly one of its two possible positions.

Insert algorithm

def insert(key):
  if T1[h1(key)] is empty:
    place key there; return
  if T2[h2(key)] is empty:
    place key there; return
  # Evict occupant of T1[h1(key)]
  displaced = T1[h1(key)]
  T1[h1(key)] = key
  # Re-insert displaced into alternate
  # Repeat until stable or cycle → rehash

Properties

PropertyValue
LookupO(1) worst case — check 2 locations
InsertO(1) amortised
Space~50% utilisation (α ≈ 0.5)
DeleteO(1) — no tombstones

Worst-case O(1) lookup — no long probe chains, no degenerate linked lists. Used in network switches, GPU hash tables, and high-frequency trading.

18

Robin Hood Hashing

Concept

Open addressing with linear probing, but elements are reordered during insertion to equalise probe distances. "Rob from the rich, give to the poor."

Insert rule

If the new key's probe distance exceeds the current occupant's, swap them and continue inserting the displaced element.

def insert(key):
  dist, idx = 0, h(key)
  while table[idx] is occupied:
    if dist > probe_dist(table[idx]):
      swap(key, table[idx])
      dist = probe_dist(table[idx])
    idx = (idx + 1) % m
    dist += 1
  table[idx] = key

Benefits

Variance reduction — max probe distance is O(log n) with high probability

Cache-friendly — linear probing with bounded displacement

Early termination — stop search when current element's probe distance exceeds search distance

In practice: Robin Hood hashing was the basis of Rust's original HashMap (pre-SwissTable) and many game engine hash maps. Its bounded probe distance makes worst-case performance predictable.

19

Locality-Sensitive Hashing

Concept

Standard hashes make similar inputs diverge. LSH does the opposite: similar items hash to the same bucket with high probability.

Common LSH families

DistanceLSH familyUse
JaccardMinHashDocument dedup
CosineSimHashSemantic similarity
EuclideanRandom proj.Image search
HammingBit samplingFingerprint match

Applications

  • Approximate nearest-neighbour search in high dimensions
  • Near-duplicate web page detection (Google, Bing)
  • Audio fingerprinting (Shazam uses a related technique)
  • Genome sequence alignment (fast seed generation)

Trade-off: LSH trades exact correctness for massive speed gains. Brute-force ANN in d dimensions is O(n); LSH reduces this to sub-linear time with tuneable accuracy.

Formal definition

A family H is (d₁, d₂, p₁, p₂)-sensitive if close points (dist ≤ d₁) collide with probability ≥ p₁ and far points (dist ≥ d₂) collide with probability ≤ p₂.

20

Applications

Caches

Hash tables power in-memory caches (Memcached, Redis). Key lookup in O(1) enables microsecond response times. Consistent hashing distributes keys across nodes.

Deduplication

Content-addressable storage hashes each data block. If the hash exists, the block is a duplicate — store a reference instead. Used in Borg, Restic, ZFS, and build caches.

Checksums & integrity

Detect corruption in data transfer and storage. CRC32 for fast error detection; SHA-256 for cryptographic integrity verification.

Bloom filters

Space-efficient probabilistic structure using k hash functions to set bits in a bit array. "Possibly in the set" or "definitely not" — false positives, no false negatives.

  • ~10 bits per element for 1% false positive rate
  • Used in databases, routers, spell checkers, Chrome safe browsing

More applications

  • HMAC — authenticate messages with a shared secret
  • DHTs — Chord, Kademlia (BitTorrent, IPFS)
  • Load balancing — hash request key to pick backend server
  • Data partitioning — shard databases by hash of partition key
21

Summary & Further Reading

Key takeaways

  • Hash functions must be deterministic, efficient, and distribute keys uniformly
  • Division method requires careful choice of m; multiplication method is more robust
  • Chaining tolerates high load factors; open addressing is cache-friendly but load-sensitive
  • Double hashing eliminates clustering; cuckoo hashing guarantees worst-case O(1) lookup
  • Robin Hood hashing bounds probe-distance variance for predictable performance
  • Consistent hashing with vnodes is the backbone of distributed storage
  • Cryptographic hashes add pre-image and collision resistance — essential for security
  • Perfect hashing achieves zero collisions for static key sets
  • LSH inverts the hash contract — similar items hash together for ANN search
  • Real-world hash tables combine open addressing with SIMD and careful engineering

Recommended reading

SourceDescription
CLRSIntroduction to Algorithms — Ch. 11, the canonical hash tables reference
KnuthTAOCP Vol. 3 — Sorting and Searching, Section 6.4
Karger et al."Consistent Hashing and Random Trees" (1997)
MitzenmacherProbability and Computing — hashing and Bloom filter analysis
Swiss TablesCppCon 2017 — Matt Kulukundis on Abseil flat_hash_map