Exact data structures (hash tables, balanced trees, sorted arrays) guarantee correct answers, but their space and time costs grow linearly — or worse — with data size.
Key properties: one-sided errors · sub-linear space · tuneable accuracy
Invented by Burton Bloom (1970). Answers one question: "Is element x in the set?"
The element might be in the set — false positive possible
The element is certainly not in the set — no false negatives
A Bloom filter with 10 bits per element and 7 hash functions achieves a false positive rate of ~0.8%.
| Component | Description |
|---|---|
| Bit array | m bits, all initialised to 0 |
| Hash functions | k independent hash functions, each mapping an element to a position in [0, m) |
Think of it as k different hash tables compressed into a single bit array. Each hash function "votes" on whether the element is present. If all k votes say yes, we report membership — but collisions can cause false positives.
For each hash function h_i (i = 1..k): compute h_i(x) mod m, set that bit to 1.
For each h_i: if any bit is 0 → definitely not in set. If all bits are 1 → probably in set.
No deletion: clearing a bit might affect other elements that hash to the same position.
m = 16 bits, k = 3 hash functions
Insert "cat": h1=2, h2=7, h3=13
[0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0]
Insert "dog": h1=4, h2=7, h3=11
[0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0]
Query "cat": 2,7,13 all=1 → probably yes ✓
Query "fish": h1=2,h2=4,h3=9 → bit 9=0 → no ✓
Query "bird": h1=2,h2=7,h3=11 → all 1
→ FALSE POSITIVE ✗
After inserting n elements into an m-bit array with k hash functions:
(1 - 1/m)kne-kn/m(1 - e-kn/m)kFPP ≈ (1 - e^(-kn/m))^k
Doubling the bits per element roughly squares the false positive rate (cuts it dramatically). Each additional ~1.44 bits per element halves the FPP at optimal k.
| Bits/elem (m/n) | k (optimal) | FPP |
|---|---|---|
| 4 | 3 | ~4.7% |
| 8 | 6 | ~2.2% |
| 10 | 7 | ~0.82% |
| 16 | 11 | ~0.046% |
| 24 | 17 | ~0.0015% |
k_opt = (m/n) × ln(2) ≈ 0.693 × (m/n)
FPP_min = (1/2)^k ≈ (0.6185)^(m/n)
Given desired false positive rate p and expected elements n:
m = -n × ln(p) / (ln(2))²
k = (m/n) × ln(2)
1 million elements, 1% false positive rate:
m = 9,585,059 bits (~1.14 MB)k = 7 hash functionsUse two hash functions and derive k values via h_i(x) = h1(x) + i × h2(x). Performance matches k independent functions.
Standard Bloom filters do not support deletion — clearing a bit might remove other elements.
Use a small counter per position (typically 4 bits):
h_i(x)h_i(x)| Aspect | Standard | Counting |
|---|---|---|
| Space | 1 bit/slot | 4 bits/slot (4x) |
| Deletion | Not supported | Supported |
| Overflow | N/A | Rare (~10-15) |
| False neg | Impossible | Possible on overflow |
Used in network routing tables and distributed caches where elements must be removed.
Stores compact fingerprints in a cuckoo hash table. Named after cuckoo hashing, where elements can be displaced to alternative locations.
b1 = hash(x), b2 = b1 XOR hash(fp(x))| Feature | Bloom | Cuckoo |
|---|---|---|
| Deletion | No | Yes |
| Space (FPP <3%) | Good | Better |
| Lookup | k accesses | 2 accesses |
| Max load | ~50% | ~95% |
Cuckoo filters are the modern default for membership queries when deletion is needed or FPP target is below ~3%.
Hash each element to a p-bit fingerprint. Split into q-bit quotient (bucket index) and r-bit remainder (stored value). p = q + r.
| Bit | Purpose |
|---|---|
is_occupied | A canonical slot for some element's quotient |
is_continuation | Part of a run (same quotient) |
is_shifted | Stored away from its canonical slot |
Quotient filters shine in on-disk scenarios (LSM trees, SSTables) because they are cache-line-aligned and support efficient merging during compaction.
Count distinct elements in a massive stream. Exact counting requires O(n) space. HyperLogLog: ~12 KB for billions of elements.
Hash elements uniformly. The maximum number of leading zeros in any hash tells you roughly log2(n_distinct).
k → estimate 2k distinct elements| System | Use |
|---|---|
| Redis | PFADD / PFCOUNT |
| BigQuery | APPROX_COUNT_DISTINCT() |
| Presto/Trino | approx_distinct() |
| Spark | approx_count_distinct() |
| PostgreSQL | hll extension |
HyperLogLog is arguably the most widely deployed probabilistic data structure in production systems today.
b bits to select one of m = 2b registersE = αm · m² · (Σ 2-R[j])-1| Registers (m) | Memory | Std Error |
|---|---|---|
| 64 | 192 B | ~13% |
| 256 | 768 B | ~6.5% |
| 1024 | 3 KB | ~3.25% |
| 4096 | 12 KB | ~1.6% |
| 16384 | 48 KB | ~0.8% |
Registers can be merged by taking element-wise maximums — perfect for distributed counting across shards.
2D array of counters: d rows (hash functions) × w columns (width). d independent hash functions, one per row.
i, increment counters[i][h_i(x)] by cmin(counters[i][h_i(x)]) across all rows1 - δ: error ≤ ε · ||stream||w = ⌈e/ε⌉, d = ⌈ln(1/δ)⌉d=3 rows, w=8 columns
Update("cat", 5):
row 0 col 3 += 5
row 1 col 7 += 5
row 2 col 1 += 5
Update("dog", 3):
row 0 col 5 += 3
row 1 col 3 += 3
row 2 col 1 += 3
Query("cat"):
min(col3=5, col7=5, col1=8) = 5 ✓
Query("dog"):
min(col5=3, col3=3, col1=8) = 3 ✓
Used for heavy-hitter detection, traffic monitoring, and frequency capping in ad tech.
Estimate similarity of two sets without full comparison. Jaccard similarity: J(A,B) = |A∩B| / |A∪B|
k independent hash functionsPr[min(h(A)) = min(h(B))] = J(A, B)
The probability that minimum hash values agree exactly equals the Jaccard similarity. Averaging over k hash functions reduces variance.
Combine MinHash signatures into bands for fast approximate nearest-neighbour search:
b bands of r rowsb and r to control the S-curve thresholdA probabilistic alternative to balanced trees. Randomised multi-level linked lists achieve O(log n) expected search, insert, and delete — without rotations.
p (typically 1/2)| Operation | Expected | Worst |
|---|---|---|
| Search | O(log n) | O(n) |
| Insert | O(log n) | O(n) |
| Delete | O(log n) | O(n) |
| Space | O(n) | O(n log n) |
Balance depends on coin flips during insertion, not deterministic rebalancing. In expectation, levels and search path length are both O(log n).
Skip lists as the primary index structure
Memtable implemented as a skip list
Concurrent skip lists for in-memory posting lists
Select k items uniformly at random from a stream of unknown length n, using only O(k) memory.
reservoir = first k items from stream
for i = k+1 to n:
j = random integer in [1, i]
if j <= k:
reservoir[j] = stream[i]
Each element ends up in the reservoir with probability k/n.
| Variant | Use Case |
|---|---|
| Weighted | Non-uniform probability proportional to weights |
| Distributed | Merge reservoirs from parallel streams |
| Algorithm L | Optimal: skips geometrically many elements |
| Structure | Space | Error Type | Delete? | Merge? | Primary Op |
|---|---|---|---|---|---|
| Bloom Filter | O(n) bits | False positive | No | OR | Membership |
| Counting Bloom | O(n) counters | False positive | Yes | Add | Membership |
| Cuckoo Filter | O(n) fps | False positive | Yes | No | Membership |
| Quotient Filter | O(n) rems | False positive | Variant | Yes | Membership |
| HyperLogLog | ~12 KB | ±~1.6% | N/A | Max | Cardinality |
| Count-Min Sketch | O(1/ε) | Over-estimate | N/A | Add | Frequency |
| MinHash | O(k)/set | ± similarity | N/A | Union | Similarity |
| Skip List | O(n) | None (exact) | Yes | No | Ordered ops |
| Reservoir | O(k) | Sampling var | N/A | Merge | Random sample |
Membership filters use bits, not pointers — orders of magnitude smaller than exact sets.
APPROX_COUNT_DISTINCTPFCOUNT)| Scenario | Recommended |
|---|---|
| Membership, no deletion | Bloom filter |
| Membership with deletion | Cuckoo filter |
| On-disk membership | Quotient filter |
| Distinct count | HyperLogLog |
| Frequency estimation | Count-Min Sketch |
| Set similarity at scale | MinHash + LSH |
| Ordered, concurrent | Skip list |
| Random sample from stream | Reservoir sampling |
Leads to unacceptable FPP as the filter fills
Poor hash functions create correlated bits, inflating FPP
Standard Bloom filters cannot be resized; use scalable variants
Propagate uncertainty through your system
| Source | Description |
|---|---|
| Bloom (1970) | "Space/Time Trade-offs in Hash Coding with Allowable Errors" |
| Flajolet et al. | "HyperLogLog: near-optimal cardinality estimation" |
| Fan et al. (2014) | "Cuckoo Filter: Practically Better Than Bloom" |
| Cormode & Muthukrishnan | "Count-Min Sketch and its Applications" |
| Mitzenmacher & Upfal | Probability and Computing |
| Redis Docs | redis.io/docs — HLL, Bloom, skip lists |