COMPUTER SCIENCE FUNDAMENTALS SERIES

Bloom Filters and
Probabilistic Data Structures

Bloom filters · HyperLogLog · Count-Min Sketch · Cuckoo filters ·
False positives · Cardinality estimation
Mid-level software engineer track  ·  20 slides
02

Why Probabilistic Data Structures?

The exactness tax

Exact data structures (hash tables, balanced trees, sorted arrays) guarantee correct answers, but their space and time costs grow linearly — or worse — with data size.

  • Storing 1 billion 64-bit keys in a hash set costs ~8 GB of RAM minimum
  • Exact distinct count over a stream requires storing every unique element
  • Exact set intersection over two billion-element sets requires materialising both

The probabilistic bargain

1.2 GB Bloom filter — 1 billion membership checks vs ~8 GB exact
12 KB HyperLogLog — count distinct over billions of elements
Fixed Count-Min Sketch — frequency estimation regardless of stream size

Key properties: one-sided errors · sub-linear space · tuneable accuracy

03

Bloom Filter — Core Idea

Invented by Burton Bloom (1970). Answers one question: "Is element x in the set?"

✓ "Probably yes"

The element might be in the set — false positive possible

✗ "Definitely no"

The element is certainly not in the set — no false negatives

A Bloom filter with 10 bits per element and 7 hash functions achieves a false positive rate of ~0.8%.

Components

ComponentDescription
Bit arraym bits, all initialised to 0
Hash functionsk independent hash functions, each mapping an element to a position in [0, m)

Think of it as k different hash tables compressed into a single bit array. Each hash function "votes" on whether the element is present. If all k votes say yes, we report membership — but collisions can cause false positives.

04

Bloom Filter — Insert & Query

Insert(x)

For each hash function h_i (i = 1..k): compute h_i(x) mod m, set that bit to 1.

Query(x)

For each h_i: if any bit is 0 → definitely not in set. If all bits are 1 → probably in set.

No deletion: clearing a bit might affect other elements that hash to the same position.

Walk-through

m = 16 bits, k = 3 hash functions

Insert "cat": h1=2, h2=7, h3=13
  [0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0]

Insert "dog": h1=4, h2=7, h3=11
  [0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0]

Query "cat":  2,7,13 all=1 → probably yes ✓
Query "fish": h1=2,h2=4,h3=9 → bit 9=0 → no ✓
Query "bird": h1=2,h2=7,h3=11 → all 1
              → FALSE POSITIVE ✗
05

False Positive Probability Analysis

Probability derivation

After inserting n elements into an m-bit array with k hash functions:

  • Probability a specific bit is still 0: (1 - 1/m)kn
  • Approximation: e-kn/m
  • False positive probability: (1 - e-kn/m)k
FPP ≈ (1 - e^(-kn/m))^k

Doubling the bits per element roughly squares the false positive rate (cuts it dramatically). Each additional ~1.44 bits per element halves the FPP at optimal k.

Practical numbers

Bits/elem (m/n)k (optimal)FPP
43~4.7%
86~2.2%
107~0.82%
1611~0.046%
2417~0.0015%
06

Optimal Number of Hash Functions

The trade-off

  • Too few hash functions — high FPP because fewer bits checked
  • Too many hash functions — bit array fills up faster, increasing FPP
  • Optimal k balances these two effects

Formulas

k_opt = (m/n) × ln(2) ≈ 0.693 × (m/n)

FPP_min = (1/2)^k ≈ (0.6185)^(m/n)

Practical sizing

Given desired false positive rate p and expected elements n:

m = -n × ln(p) / (ln(2))²
k = (m/n) × ln(2)

Example

1 million elements, 1% false positive rate:

  • m = 9,585,059 bits (~1.14 MB)
  • k = 7 hash functions

Kirsch-Mitzenmacker optimisation

Use two hash functions and derive k values via h_i(x) = h1(x) + i × h2(x). Performance matches k independent functions.

07

Counting Bloom Filters

Problem

Standard Bloom filters do not support deletion — clearing a bit might remove other elements.

Solution: replace bits with counters

Use a small counter per position (typically 4 bits):

  • Insert(x) — increment each counter at h_i(x)
  • Delete(x) — decrement each counter at h_i(x)
  • Query(x) — all counters > 0? → probably in set

Trade-offs

AspectStandardCounting
Space1 bit/slot4 bits/slot (4x)
DeletionNot supportedSupported
OverflowN/ARare (~10-15)
False negImpossiblePossible on overflow

Used in network routing tables and distributed caches where elements must be removed.

08

Cuckoo Filters

Stores compact fingerprints in a cuckoo hash table. Named after cuckoo hashing, where elements can be displaced to alternative locations.

Structure

  • Array of buckets, each holding a small number of slots (typically 4)
  • Each element → a fingerprint (8–12 bits)
  • Two candidate buckets: b1 = hash(x), b2 = b1 XOR hash(fp(x))

Operations

  • Insert — place fingerprint in b1 or b2; if both full, evict (cuckoo displacement)
  • Lookup — check fingerprint in b1 or b2
  • Delete — remove matching fingerprint from b1 or b2

vs Bloom filters

FeatureBloomCuckoo
DeletionNoYes
Space (FPP <3%)GoodBetter
Lookupk accesses2 accesses
Max load~50%~95%

Cuckoo filters are the modern default for membership queries when deletion is needed or FPP target is below ~3%.

09

Quotient Filters

Cache-friendly alternative

Hash each element to a p-bit fingerprint. Split into q-bit quotient (bucket index) and r-bit remainder (stored value). p = q + r.

Key properties

  • Cache-friendly — contiguous array, no pointer chasing
  • Mergeable — two filters merged in linear time
  • Resizable — can be doubled by redistributing remainders
  • Deletion — supported with counting variants

Metadata bits per slot

BitPurpose
is_occupiedA canonical slot for some element's quotient
is_continuationPart of a run (same quotient)
is_shiftedStored away from its canonical slot

Quotient filters shine in on-disk scenarios (LSM trees, SSTables) because they are cache-line-aligned and support efficient merging during compaction.

10

HyperLogLog — Cardinality Estimation

Problem

Count distinct elements in a massive stream. Exact counting requires O(n) space. HyperLogLog: ~12 KB for billions of elements.

Core insight

Hash elements uniformly. The maximum number of leading zeros in any hash tells you roughly log2(n_distinct).

  • Longest run of leading zeros = k → estimate 2k distinct elements
  • Single observation is noisy — HyperLogLog uses many independent observations

Real-world usage

SystemUse
RedisPFADD / PFCOUNT
BigQueryAPPROX_COUNT_DISTINCT()
Presto/Trinoapprox_distinct()
Sparkapprox_count_distinct()
PostgreSQLhll extension

HyperLogLog is arguably the most widely deployed probabilistic data structure in production systems today.

11

HyperLogLog — How It Works

Algorithm

  1. Hash each element to a uniform binary string
  2. Use first b bits to select one of m = 2b registers
  3. Count leading zeros in remaining bits; store the maximum per register
  4. Estimate: E = αm · m² · (Σ 2-R[j])-1

Corrections

  • Small range — many registers empty → use linear counting
  • Large range — estimate exceeds 232/30 → hash collision correction
  • HLL++ (Google, 2013) — improved bias correction, sparse representation

Space and accuracy

Registers (m)MemoryStd Error
64192 B~13%
256768 B~6.5%
10243 KB~3.25%
409612 KB~1.6%
1638448 KB~0.8%

Mergeability

Registers can be merged by taking element-wise maximums — perfect for distributed counting across shards.

12

Count-Min Sketch

Structure

2D array of counters: d rows (hash functions) × w columns (width). d independent hash functions, one per row.

Operations

  • Update(x, c) — for each row i, increment counters[i][h_i(x)] by c
  • Query(x) — return min(counters[i][h_i(x)]) across all rows

Error bounds

  • Estimate is always ≥ true count (over-estimates only)
  • With probability 1 - δ: error ≤ ε · ||stream||
  • Space: w = ⌈e/ε⌉, d = ⌈ln(1/δ)⌉

Walk-through

d=3 rows, w=8 columns

Update("cat", 5):
  row 0 col 3 += 5
  row 1 col 7 += 5
  row 2 col 1 += 5

Update("dog", 3):
  row 0 col 5 += 3
  row 1 col 3 += 3
  row 2 col 1 += 3

Query("cat"):
  min(col3=5, col7=5, col1=8) = 5 ✓

Query("dog"):
  min(col5=3, col3=3, col1=8) = 3 ✓

Used for heavy-hitter detection, traffic monitoring, and frequency capping in ad tech.

13

MinHash — Jaccard Similarity

Problem

Estimate similarity of two sets without full comparison. Jaccard similarity: J(A,B) = |A∩B| / |A∪B|

Algorithm

  1. Choose k independent hash functions
  2. For each set, compute the minimum hash value under each function
  3. Fraction of matching minimums estimates Jaccard similarity
Pr[min(h(A)) = min(h(B))] = J(A, B)

The probability that minimum hash values agree exactly equals the Jaccard similarity. Averaging over k hash functions reduces variance.

Locality-Sensitive Hashing (LSH)

Combine MinHash signatures into bands for fast approximate nearest-neighbour search:

  • Divide signature into b bands of r rows
  • Items are candidates if they agree in all rows of at least one band
  • Tune b and r to control the S-curve threshold

Applications

near-duplicate detection recommendation plagiarism detection code clone search
14

Skip Lists

A probabilistic alternative to balanced trees. Randomised multi-level linked lists achieve O(log n) expected search, insert, and delete — without rotations.

Structure

  • Level 0 — sorted linked list of all elements
  • Level 1+ — each element promoted with probability p (typically 1/2)
  • Search starts at the top level, drops down when overshooting

Expected complexity

OperationExpectedWorst
SearchO(log n)O(n)
InsertO(log n)O(n)
DeleteO(log n)O(n)
SpaceO(n)O(n log n)

Why probabilistic?

Balance depends on coin flips during insertion, not deterministic rebalancing. In expectation, levels and search path length are both O(log n).

Real-world usage

Redis sorted sets (ZSET)

Skip lists as the primary index structure

LevelDB / RocksDB

Memtable implemented as a skip list

Lucene

Concurrent skip lists for in-memory posting lists

15

Reservoir Sampling

Problem

Select k items uniformly at random from a stream of unknown length n, using only O(k) memory.

Algorithm (Vitter's Algorithm R)

reservoir = first k items from stream

for i = k+1 to n:
    j = random integer in [1, i]
    if j <= k:
        reservoir[j] = stream[i]

Each element ends up in the reservoir with probability k/n.

Variants

VariantUse Case
WeightedNon-uniform probability proportional to weights
DistributedMerge reservoirs from parallel streams
Algorithm LOptimal: skips geometrically many elements

Applications

  • Sampling log lines for analysis without storing the full stream
  • A/B test random assignment from an unbounded user stream
  • Random row selection from a database table scan
16

Comparison Table

Probabilistic data structures at a glance

StructureSpaceError TypeDelete?Merge?Primary Op
Bloom FilterO(n) bitsFalse positiveNoORMembership
Counting BloomO(n) countersFalse positiveYesAddMembership
Cuckoo FilterO(n) fpsFalse positiveYesNoMembership
Quotient FilterO(n) remsFalse positiveVariantYesMembership
HyperLogLog~12 KB±~1.6%N/AMaxCardinality
Count-Min SketchO(1/ε)Over-estimateN/AAddFrequency
MinHashO(k)/set± similarityN/AUnionSimilarity
Skip ListO(n)None (exact)YesNoOrdered ops
ReservoirO(k)Sampling varN/AMergeRandom sample

Membership filters use bits, not pointers — orders of magnitude smaller than exact sets.

17

Applications — Databases & Caches

Databases

  • LSM-tree SSTables — Bloom filters avoid unnecessary disk reads. LevelDB, RocksDB, Cassandra, HBase.
  • PostgreSQL — Bloom indexes for multi-column equality filters
  • BigQuery / Presto — HLL for APPROX_COUNT_DISTINCT
  • Query planners — Count-Min Sketch for join selectivity estimation

Caches

  • CDN edges — Bloom filter gates origin fetch; avoids cache pollution
  • Chrome Safe Browsing — Bloom filter checks URLs vs malware list locally
  • Squid proxy — cache digests are Bloom filters for cooperative caching

Distributed systems

  • Cassandra — Bloom filter per SSTable for key lookups
  • Bitcoin SPV — Bloom filters for lightweight transaction filtering
  • HDFS — HyperLogLog for file-system analytics across petabytes
18

Applications — Networks & Analytics

Network routers & security

  • Packet dedup — Bloom filters detect duplicates at line rate
  • DDoS mitigation — Count-Min Sketch identifies heavy-hitter IPs
  • Firewall ACLs — Bloom filter for fast IP/port membership

Spell checkers & NLP

  • Dictionaries — Bloom filter stores word list without full dictionary in memory
  • Web crawlers — visited-URL tracking via Bloom filter (Google, Bing)
  • Plagiarism — MinHash over word n-grams for near-duplicate detection

Analytics & streaming

  • Unique visitors — HyperLogLog for real-time uniques (Redis PFCOUNT)
  • Ad frequency capping — Count-Min Sketch limits ad impressions per user
  • Dashboards — reservoir sampling for percentile estimation
  • Stream joins — Bloom filter pre-filters one side, reducing waste
19

Design Trade-offs

Choosing the right structure

ScenarioRecommended
Membership, no deletionBloom filter
Membership with deletionCuckoo filter
On-disk membershipQuotient filter
Distinct countHyperLogLog
Frequency estimationCount-Min Sketch
Set similarity at scaleMinHash + LSH
Ordered, concurrentSkip list
Random sample from streamReservoir sampling

Common pitfalls

Under-sizing Bloom filters

Leads to unacceptable FPP as the filter fills

Ignoring hash quality

Poor hash functions create correlated bits, inflating FPP

Forgetting about growth

Standard Bloom filters cannot be resized; use scalable variants

Treating estimates as exact

Propagate uncertainty through your system

20

Summary & Further Reading

Key takeaways

  • Probabilistic structures trade bounded, tuneable error for dramatic space and time savings
  • Bloom filters remain the workhorse for membership queries — simple, fast, well-analysed
  • HyperLogLog is the standard for cardinality estimation in databases and streaming
  • Count-Min Sketch is the go-to for frequency estimation in streaming and network monitoring
  • Cuckoo filters improve on Bloom filters when deletion is needed or FPP < 3%
  • MinHash enables similarity estimation at scale for deduplication and recommendation
  • Skip lists prove randomisation can replace complex balancing logic
  • Always size structures based on expected volume and acceptable error rate

Recommended reading

SourceDescription
Bloom (1970)"Space/Time Trade-offs in Hash Coding with Allowable Errors"
Flajolet et al."HyperLogLog: near-optimal cardinality estimation"
Fan et al. (2014)"Cuckoo Filter: Practically Better Than Bloom"
Cormode & Muthukrishnan"Count-Min Sketch and its Applications"
Mitzenmacher & UpfalProbability and Computing
Redis Docsredis.io/docs — HLL, Bloom, skip lists
Bloom filter HyperLogLog Count-Min Sketch Cuckoo filter MinHash Skip list