COMPUTER SCIENCE FUNDAMENTALS SERIES

Bloom Filters and
Probabilistic Data Structures

Bloom filters · HyperLogLog · Count-Min Sketch · Cuckoo filters ·
False positives · Cardinality estimation

Mid-level software engineer track · 20 slides

Why Probabilistic Data Structures?

The exactness tax

Exact data structures (hash tables, balanced trees, sorted arrays) guarantee correct answers, but their space and time costs grow linearly — or worse — with data size.

Storing 1 billion 64-bit keys in a hash set costs ~8 GB of RAM minimum
Exact distinct count over a stream requires storing every unique element
Exact set intersection over two billion-element sets requires materialising both

The probabilistic bargain

1.2 GB Bloom filter — 1 billion membership checks vs ~8 GB exact

12 KB HyperLogLog — count distinct over billions of elements

Fixed Count-Min Sketch — frequency estimation regardless of stream size

Key properties: one-sided errors · sub-linear space · tuneable accuracy

Bloom Filter — Core Idea

Invented by Burton Bloom (1970). Answers one question: "Is element x in the set?"

✓ "Probably yes"

The element might be in the set — false positive possible

✗ "Definitely no"

The element is certainly not in the set — no false negatives

A Bloom filter with 10 bits per element and 7 hash functions achieves a false positive rate of ~0.8%.

Components

Component	Description
Bit array	`m` bits, all initialised to 0
Hash functions	`k` independent hash functions, each mapping an element to a position in `[0, m)`

Think of it as k different hash tables compressed into a single bit array. Each hash function "votes" on whether the element is present. If all k votes say yes, we report membership — but collisions can cause false positives.

Bloom Filter — Insert & Query

Insert(x)

For each hash function h_i (i = 1..k): compute h_i(x) mod m, set that bit to 1.

Query(x)

For each h_i: if any bit is 0 → definitely not in set. If all bits are 1 → probably in set.

No deletion: clearing a bit might affect other elements that hash to the same position.

Walk-through

m = 16 bits, k = 3 hash functions

Insert "cat": h1=2, h2=7, h3=13
  [0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0]

Insert "dog": h1=4, h2=7, h3=11
  [0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0]

Query "cat":  2,7,13 all=1 → probably yes ✓
Query "fish": h1=2,h2=4,h3=9 → bit 9=0 → no ✓
Query "bird": h1=2,h2=7,h3=11 → all 1
              → FALSE POSITIVE ✗

False Positive Probability Analysis

Probability derivation

After inserting n elements into an m-bit array with k hash functions:

Probability a specific bit is still 0: (1 - 1/m)^kn
Approximation: e^-kn/m
False positive probability: (1 - e^-kn/m)^k

FPP ≈ (1 - e^(-kn/m))^k

Doubling the bits per element roughly squares the false positive rate (cuts it dramatically). Each additional ~1.44 bits per element halves the FPP at optimal k.

Practical numbers

Bits/elem (m/n)	k (optimal)	FPP
4	3	~4.7%
8	6	~2.2%
10	7	~0.82%
16	11	~0.046%
24	17	~0.0015%

Optimal Number of Hash Functions

The trade-off

Too few hash functions — high FPP because fewer bits checked
Too many hash functions — bit array fills up faster, increasing FPP
Optimal k balances these two effects

Formulas

k_opt = (m/n) × ln(2) ≈ 0.693 × (m/n)

FPP_min = (1/2)^k ≈ (0.6185)^(m/n)

Practical sizing

Given desired false positive rate p and expected elements n:

m = -n × ln(p) / (ln(2))²
k = (m/n) × ln(2)

Example

1 million elements, 1% false positive rate:

m = 9,585,059 bits (~1.14 MB)
k = 7 hash functions

Kirsch-Mitzenmacker optimisation

Use two hash functions and derive k values via h_i(x) = h1(x) + i × h2(x). Performance matches k independent functions.

Counting Bloom Filters

Problem

Standard Bloom filters do not support deletion — clearing a bit might remove other elements.

Solution: replace bits with counters

Use a small counter per position (typically 4 bits):

Insert(x) — increment each counter at h_i(x)
Delete(x) — decrement each counter at h_i(x)
Query(x) — all counters > 0? → probably in set

Trade-offs

Aspect	Standard	Counting
Space	1 bit/slot	4 bits/slot (4x)
Deletion	Not supported	Supported
Overflow	N/A	Rare (~10^-15)
False neg	Impossible	Possible on overflow

Used in network routing tables and distributed caches where elements must be removed.

Cuckoo Filters

Stores compact fingerprints in a cuckoo hash table. Named after cuckoo hashing, where elements can be displaced to alternative locations.

Structure

Array of buckets, each holding a small number of slots (typically 4)
Each element → a fingerprint (8–12 bits)
Two candidate buckets: b1 = hash(x), b2 = b1 XOR hash(fp(x))

Operations

Insert — place fingerprint in b1 or b2; if both full, evict (cuckoo displacement)
Lookup — check fingerprint in b1 or b2
Delete — remove matching fingerprint from b1 or b2

vs Bloom filters

Feature	Bloom	Cuckoo
Deletion	No	Yes
Space (FPP <3%)	Good	Better
Lookup	k accesses	2 accesses
Max load	~50%	~95%

Cuckoo filters are the modern default for membership queries when deletion is needed or FPP target is below ~3%.

Quotient Filters

Cache-friendly alternative

Hash each element to a p-bit fingerprint. Split into q-bit quotient (bucket index) and r-bit remainder (stored value). p = q + r.

Key properties

Cache-friendly — contiguous array, no pointer chasing
Mergeable — two filters merged in linear time
Resizable — can be doubled by redistributing remainders
Deletion — supported with counting variants

Metadata bits per slot

Bit	Purpose
`is_occupied`	A canonical slot for some element's quotient
`is_continuation`	Part of a run (same quotient)
`is_shifted`	Stored away from its canonical slot

Quotient filters shine in on-disk scenarios (LSM trees, SSTables) because they are cache-line-aligned and support efficient merging during compaction.

HyperLogLog — Cardinality Estimation

Problem

Count distinct elements in a massive stream. Exact counting requires O(n) space. HyperLogLog: ~12 KB for billions of elements.

Core insight

Hash elements uniformly. The maximum number of leading zeros in any hash tells you roughly log2(n_distinct).

Longest run of leading zeros = k → estimate 2^k distinct elements
Single observation is noisy — HyperLogLog uses many independent observations

Real-world usage

System	Use
Redis	`PFADD` / `PFCOUNT`
BigQuery	`APPROX_COUNT_DISTINCT()`
Presto/Trino	`approx_distinct()`
Spark	`approx_count_distinct()`
PostgreSQL	`hll` extension

HyperLogLog is arguably the most widely deployed probabilistic data structure in production systems today.

HyperLogLog — How It Works

Algorithm

Hash each element to a uniform binary string
Use first b bits to select one of m = 2^b registers
Count leading zeros in remaining bits; store the maximum per register
Estimate: E = α_m · m² · (Σ 2^-R[j])^-1

Corrections

Small range — many registers empty → use linear counting
Large range — estimate exceeds 2³²/30 → hash collision correction
HLL++ (Google, 2013) — improved bias correction, sparse representation

Space and accuracy

Registers (m)	Memory	Std Error
64	192 B	~13%
256	768 B	~6.5%
1024	3 KB	~3.25%
4096	12 KB	~1.6%
16384	48 KB	~0.8%

Mergeability

Registers can be merged by taking element-wise maximums — perfect for distributed counting across shards.

Count-Min Sketch

Structure

2D array of counters: d rows (hash functions) × w columns (width). d independent hash functions, one per row.

Operations

Update(x, c) — for each row i, increment counters[i][h_i(x)] by c
Query(x) — return min(counters[i][h_i(x)]) across all rows

Error bounds

Estimate is always ≥ true count (over-estimates only)
With probability 1 - δ: error ≤ ε · ||stream||
Space: w = ⌈e/ε⌉, d = ⌈ln(1/δ)⌉

Walk-through

d=3 rows, w=8 columns

Update("cat", 5):
  row 0 col 3 += 5
  row 1 col 7 += 5
  row 2 col 1 += 5

Update("dog", 3):
  row 0 col 5 += 3
  row 1 col 3 += 3
  row 2 col 1 += 3

Query("cat"):
  min(col3=5, col7=5, col1=8) = 5 ✓

Query("dog"):
  min(col5=3, col3=3, col1=8) = 3 ✓

Used for heavy-hitter detection, traffic monitoring, and frequency capping in ad tech.

MinHash — Jaccard Similarity

Problem

Estimate similarity of two sets without full comparison. Jaccard similarity: J(A,B) = |A∩B| / |A∪B|

Algorithm

Choose k independent hash functions
For each set, compute the minimum hash value under each function
Fraction of matching minimums estimates Jaccard similarity

Pr[min(h(A)) = min(h(B))] = J(A, B)

The probability that minimum hash values agree exactly equals the Jaccard similarity. Averaging over k hash functions reduces variance.

Locality-Sensitive Hashing (LSH)

Combine MinHash signatures into bands for fast approximate nearest-neighbour search:

Divide signature into b bands of r rows
Items are candidates if they agree in all rows of at least one band
Tune b and r to control the S-curve threshold

Applications

near-duplicate detection recommendation plagiarism detection code clone search

Skip Lists

A probabilistic alternative to balanced trees. Randomised multi-level linked lists achieve O(log n) expected search, insert, and delete — without rotations.

Structure

Level 0 — sorted linked list of all elements
Level 1+ — each element promoted with probability p (typically 1/2)
Search starts at the top level, drops down when overshooting

Expected complexity

Operation	Expected	Worst
Search	`O(log n)`	`O(n)`
Insert	`O(log n)`	`O(n)`
Delete	`O(log n)`	`O(n)`
Space	`O(n)`	`O(n log n)`

Why probabilistic?

Balance depends on coin flips during insertion, not deterministic rebalancing. In expectation, levels and search path length are both O(log n).

Real-world usage

Redis sorted sets (ZSET)

Skip lists as the primary index structure

LevelDB / RocksDB

Memtable implemented as a skip list

Lucene

Concurrent skip lists for in-memory posting lists

Reservoir Sampling

Problem

Select k items uniformly at random from a stream of unknown length n, using only O(k) memory.

Algorithm (Vitter's Algorithm R)

reservoir = first k items from stream

for i = k+1 to n:
    j = random integer in [1, i]
    if j <= k:
        reservoir[j] = stream[i]

Each element ends up in the reservoir with probability k/n.

Variants

Variant	Use Case
Weighted	Non-uniform probability proportional to weights
Distributed	Merge reservoirs from parallel streams
Algorithm L	Optimal: skips geometrically many elements

Applications

Sampling log lines for analysis without storing the full stream
A/B test random assignment from an unbounded user stream
Random row selection from a database table scan

Comparison Table

Probabilistic data structures at a glance

Structure	Space	Error Type	Delete?	Merge?	Primary Op
Bloom Filter	`O(n)` bits	False positive	No	OR	Membership
Counting Bloom	`O(n)` counters	False positive	Yes	Add	Membership
Cuckoo Filter	`O(n)` fps	False positive	Yes	No	Membership
Quotient Filter	`O(n)` rems	False positive	Variant	Yes	Membership
HyperLogLog	~12 KB	±~1.6%	N/A	Max	Cardinality
Count-Min Sketch	`O(1/ε)`	Over-estimate	N/A	Add	Frequency
MinHash	`O(k)`/set	± similarity	N/A	Union	Similarity
Skip List	`O(n)`	None (exact)	Yes	No	Ordered ops
Reservoir	`O(k)`	Sampling var	N/A	Merge	Random sample

Membership filters use bits, not pointers — orders of magnitude smaller than exact sets.

Applications — Databases & Caches

Databases

LSM-tree SSTables — Bloom filters avoid unnecessary disk reads. LevelDB, RocksDB, Cassandra, HBase.
PostgreSQL — Bloom indexes for multi-column equality filters
BigQuery / Presto — HLL for APPROX_COUNT_DISTINCT
Query planners — Count-Min Sketch for join selectivity estimation

Caches

CDN edges — Bloom filter gates origin fetch; avoids cache pollution
Chrome Safe Browsing — Bloom filter checks URLs vs malware list locally
Squid proxy — cache digests are Bloom filters for cooperative caching

Distributed systems

Cassandra — Bloom filter per SSTable for key lookups
Bitcoin SPV — Bloom filters for lightweight transaction filtering
HDFS — HyperLogLog for file-system analytics across petabytes

Applications — Networks & Analytics

Network routers & security

Packet dedup — Bloom filters detect duplicates at line rate
DDoS mitigation — Count-Min Sketch identifies heavy-hitter IPs
Firewall ACLs — Bloom filter for fast IP/port membership

Spell checkers & NLP

Dictionaries — Bloom filter stores word list without full dictionary in memory
Web crawlers — visited-URL tracking via Bloom filter (Google, Bing)
Plagiarism — MinHash over word n-grams for near-duplicate detection

Analytics & streaming

Unique visitors — HyperLogLog for real-time uniques (Redis PFCOUNT)
Ad frequency capping — Count-Min Sketch limits ad impressions per user
Dashboards — reservoir sampling for percentile estimation
Stream joins — Bloom filter pre-filters one side, reducing waste

Design Trade-offs

Choosing the right structure

Scenario	Recommended
Membership, no deletion	Bloom filter
Membership with deletion	Cuckoo filter
On-disk membership	Quotient filter
Distinct count	HyperLogLog
Frequency estimation	Count-Min Sketch
Set similarity at scale	MinHash + LSH
Ordered, concurrent	Skip list
Random sample from stream	Reservoir sampling

Common pitfalls

Under-sizing Bloom filters

Leads to unacceptable FPP as the filter fills

Ignoring hash quality

Poor hash functions create correlated bits, inflating FPP

Forgetting about growth

Standard Bloom filters cannot be resized; use scalable variants

Treating estimates as exact

Propagate uncertainty through your system

Summary & Further Reading

Key takeaways

Probabilistic structures trade bounded, tuneable error for dramatic space and time savings
Bloom filters remain the workhorse for membership queries — simple, fast, well-analysed
HyperLogLog is the standard for cardinality estimation in databases and streaming
Count-Min Sketch is the go-to for frequency estimation in streaming and network monitoring
Cuckoo filters improve on Bloom filters when deletion is needed or FPP < 3%
MinHash enables similarity estimation at scale for deduplication and recommendation
Skip lists prove randomisation can replace complex balancing logic
Always size structures based on expected volume and acceptable error rate

Source	Description
Bloom (1970)	"Space/Time Trade-offs in Hash Coding with Allowable Errors"
Flajolet et al.	"HyperLogLog: near-optimal cardinality estimation"
Fan et al. (2014)	"Cuckoo Filter: Practically Better Than Bloom"
Cormode & Muthukrishnan	"Count-Min Sketch and its Applications"
Mitzenmacher & Upfal	Probability and Computing
Redis Docs	redis.io/docs — HLL, Bloom, skip lists

Bloom Filters andProbabilistic Data Structures

Why Probabilistic Data Structures?

The exactness tax

The probabilistic bargain

Bloom Filter — Core Idea

✓ "Probably yes"

✗ "Definitely no"

Components

Bloom Filter — Insert & Query

Insert(x)

Query(x)

Walk-through

False Positive Probability Analysis

Probability derivation

Practical numbers

Optimal Number of Hash Functions

The trade-off

Formulas

Practical sizing

Example

Kirsch-Mitzenmacker optimisation

Counting Bloom Filters

Problem

Solution: replace bits with counters

Trade-offs

Cuckoo Filters

Structure

Operations

vs Bloom filters

Quotient Filters

Cache-friendly alternative

Key properties

Metadata bits per slot

HyperLogLog — Cardinality Estimation

Problem

Core insight

Real-world usage

HyperLogLog — How It Works

Algorithm

Corrections

Space and accuracy

Mergeability

Count-Min Sketch

Structure

Operations

Error bounds

Walk-through

MinHash — Jaccard Similarity

Problem

Algorithm

Locality-Sensitive Hashing (LSH)

Applications

Skip Lists

Structure

Expected complexity

Why probabilistic?

Real-world usage

Redis sorted sets (ZSET)

LevelDB / RocksDB

Lucene

Reservoir Sampling

Problem

Algorithm (Vitter's Algorithm R)

Variants

Applications

Comparison Table

Probabilistic data structures at a glance

Applications — Databases & Caches

Databases

Caches

Distributed systems

Applications — Networks & Analytics

Network routers & security

Spell checkers & NLP

Analytics & streaming

Design Trade-offs

Choosing the right structure

Common pitfalls

Under-sizing Bloom filters

Ignoring hash quality

Bloom Filters and
Probabilistic Data Structures