COMPUTER SCIENCE FUNDAMENTALS SERIES

String Matching
Algorithms

KMP · Rabin-Karp · Boyer-Moore · Suffix arrays ·
Aho-Corasick · Pattern matching

Mid-level software engineer track · 20 slides

The String Matching Problem

Definition

Given a text T of length n and a pattern P of length m, find all occurrences of P in T.

Input: text T[0..n-1], pattern P[0..m-1]
Output: all positions i where T[i..i+m-1] == P

Why it matters

Every time you press Ctrl+F, run grep, query a database with LIKE, or align a DNA sequence, a string matching algorithm is at work.

Terminology

Term	Meaning
Text	The string being searched (haystack)
Pattern	The string being searched for (needle)
Alphabet	The set of valid characters, size `\|Σ\|`
Shift	An alignment position of pattern against text
Valid shift	A shift where every character matches

The naive approach checks every shift. Efficient algorithms skip provably invalid shifts.

Naive String Matching

The brute-force approach

For each position i in the text, compare P[0..m-1] against T[i..i+m-1] character by character.

NAIVE-STRING-MATCH(T, P):
  n = len(T), m = len(P)
  for i = 0 to n - m:
    if T[i..i+m-1] == P[0..m-1]:
      report match at i

When it is good enough

Short patterns (m < 10) on moderate text
Large alphabets (English text) where mismatches occur early
One-off searches where preprocessing overhead is not justified

Complexity

Worst case

O(n · m) — e.g. T = "aaaaaa", P = "aaab"

Best case

O(n) — first character of pattern rarely matches

Average case

O(n) — for random text over a large alphabet

Space

O(1) — no preprocessing required

The naive algorithm discards all information from a partial match. Every advanced algorithm exploits that wasted knowledge.

KMP — Intuition & Failure Function

Core insight

When a mismatch occurs at P[j] after matching P[0..j-1], the naive algorithm restarts from scratch. KMP asks: what is the longest proper prefix of P[0..j-1] that is also a suffix? Resume matching from there.

The failure function (prefix table)

π[j] = length of the longest proper prefix of P[0..j] that is also a suffix.

Pattern: A B A B A C
Index:   0 1 2 3 4 5
π:       0 0 1 2 3 0

Building the prefix table — O(m)

COMPUTE-PREFIX(P):
  m = len(P)
  π[0] = 0
  k = 0
  for q = 1 to m - 1:
    while k > 0 and P[k] != P[q]:
      k = π[k - 1]
    if P[k] == P[q]:
      k = k + 1
    π[q] = k
  return π

The prefix table encodes the internal structure of the pattern so that no text character is ever re-examined.

KMP — Algorithm Walkthrough

The algorithm

KMP-SEARCH(T, P):
  n = len(T), m = len(P)
  π = COMPUTE-PREFIX(P)
  q = 0
  for i = 0 to n - 1:
    while q > 0 and P[q] != T[i]:
      q = π[q - 1]       // fall back
    if P[q] == T[i]:
      q = q + 1
    if q == m:
      report match at i - m + 1
      q = π[q - 1]       // next match

Trace example

Text:    A B A B A B A C A B
Pattern: A B A B A C
π:       0 0 1 2 3 0

i=4: matched ABABA, mismatch P[5]='C' vs T[5]='B'
     q = π[4] = 3, resume P[3] with T[5]
     → skipped re-examining T[0..2]

Complexity

Metric	Value
Preprocessing	`O(m)` — build prefix table
Matching	`O(n)` — each character examined at most twice
Total	`O(n + m)`
Space	`O(m)` — prefix table

KMP guarantees exactly O(n) comparisons in the worst case — no algorithm can do fewer for single-pattern exact matching.

Rabin-Karp — Rolling Hash

Idea

Instead of comparing characters, compute a hash of each m-length window and compare it to the hash of the pattern. A rolling hash updates in O(1) per shift.

Rolling hash (polynomial)

hash(S[i..i+m-1]) =
  Σ S[i+j] · d^(m-1-j) mod q

d = |Σ| (alphabet size)
q = large prime

Sliding the window

hash(S[i+1..i+m]) =
  (hash(S[i..i+m-1])
   - S[i] · d^(m-1)) · d
   + S[i+m]  mod q

Algorithm sketch

RABIN-KARP(T, P):
  hp = hash(P[0..m-1])
  ht = hash(T[0..m-1])
  for i = 0 to n - m:
    if ht == hp:
      if T[i..i+m-1] == P:  // verify
        report match at i
    ht = roll(ht, T[i], T[i+m])

Key strength

Natural extension to multi-pattern search — hash each pattern, store in a set, check every window against the set.

Rabin-Karp — Spurious Hits & Multi-Pattern

Spurious hits

Two different strings can have the same hash (collision). Each collision triggers a full O(m) character comparison. With a good prime q, the expected number of spurious hits is O(n/q).

Complexity

Case	Time
Best / average	`O(n + m)`
Worst case	`O(n · m)` — many collisions
Space	`O(1)` beyond input

Multi-pattern variant

To search for k patterns simultaneously:

Hash all k patterns and store in a hash set
Slide window over text; check each window hash against the set
Expected time: O(n + km)

Practical tips

Use double hashing (two independent hash functions) to reduce false positives to near zero
Rabin-Karp is the basis of many plagiarism detectors (compare document fingerprints)
Works well when pattern length is fixed and known

Rabin-Karp trades deterministic guarantees for practical speed and multi-pattern flexibility.

Boyer-Moore — Bad Character Rule

Why Boyer-Moore?

Often the fastest single-pattern algorithm in practice. Scans the pattern right to left and uses two heuristics to skip large portions of text.

Bad character rule

When a mismatch occurs at text character c and pattern position j:

If c does not appear in the pattern → shift pattern past the mismatch entirely
If c appears at position k < j → align that occurrence with the text

Example

Text:    ...X E R C I S E...
Pattern:    E X A M P L E
            ←───── comparing right to left
Mismatch: T[3]='R' vs P[3]='M'
'R' not in pattern
  → shift pattern 4 positions right

Bad character table

BUILD-BAD-CHAR(P):
  for each c in Σ: bc[c] = -1
  for j = 0 to m - 1:
    bc[P[j]] = j
  return bc

The bad character rule alone gives sub-linear average performance on large alphabets.

Boyer-Moore — Good Suffix Rule

Good suffix rule

When characters P[j+1..m-1] have matched but P[j] mismatches:

Find the rightmost occurrence of the matched suffix elsewhere in the pattern
If no full occurrence exists, find the longest prefix of the pattern that matches a suffix of the matched portion
Shift accordingly

Combined shift

shift = max(bad_char_shift,
            good_suffix_shift)

Boyer-Moore always takes the maximum of both heuristics — the larger the skip, the better.

Complexity

Metric	Value
Preprocessing	`O(m + \|Σ\|)`
Best case	`O(n/m)` — sub-linear!
Worst case	`O(n · m)` — `O(n)` with Galil
Average	`O(n/m)` for large alphabets

In practice

Default algorithm in GNU grep
Faster than KMP for long patterns on natural-language text
The O(n/m) best case means longer patterns can actually be faster to find

Boyer-Moore is the rare algorithm where increasing the pattern length improves performance.

Z-Algorithm

Z-array definition

For a string S, Z[i] = length of the longest substring starting at S[i] that matches a prefix of S.

S:  a a b x a a b
Z:  - 1 0 0 3 1 0

Algorithm — O(n)

Maintain a Z-box [l, r] representing the rightmost interval matching a prefix. For each position i:

If i > r, compute Z[i] naively and update [l, r]
If i ≤ r, use previously computed values to initialise Z[i], then extend if needed

String matching with Z-algorithm

Concatenate P$T (where $ is not in the alphabet) and compute the Z-array. Any position i where Z[i] == m is a match.

P = "aab", T = "aabxaab"
S = "aab$aabxaab"
Z = [- 1 0 0 3 1 0 0 3 1 0]
  matches at i=4 and i=8
  → text positions 0 and 4

Complexity

Metric	Value
Time	`O(n + m)`
Space	`O(n + m)`

The Z-algorithm is conceptually simpler than KMP and equally powerful. It is also the foundation for many suffix-based constructions.

Aho-Corasick — Multi-Pattern Matching

Problem

Given k patterns and text of length n, find all occurrences of all patterns in one pass.

Construction

Build a trie from all k patterns
Add failure links — analogous to KMP's failure function; on mismatch, follow the failure link to the longest proper suffix that is also a prefix of some pattern
Add output links — chain patterns that end at the same trie node via suffixes

Example

Patterns: {he, she, his, hers}
Text:     "ushers"

Matches: she (at 1), he (at 2),
         hers (at 2)

Search — O(n + z)

Feed the text through the automaton character by character. At each state, follow output links to report all matching patterns. z = total matches reported.

Complexity

Metric	Value
Build	`O(m · \|Σ\|)` where `m` = total pattern length
Search	`O(n + z)` — linear in text + output
Space	`O(m · \|Σ\|)` for the automaton

Used in

Snort IDS fgrep virus scanners ripgrep

Aho-Corasick is the multi-pattern generalisation of KMP.

Suffix Arrays

Definition

A suffix array SA for string S of length n is a sorted array of all suffix indices ordered by the lexicographic order of the suffixes they represent.

S = "banana$"
Suffixes sorted:     SA:
$                    6
a$                   5
ana$                 3
anana$               1
banana$              0
na$                  4
nana$                2

Pattern search

Binary search the suffix array. Each comparison is O(m), so search is O(m log n). With an LCP array, this improves to O(m + log n).

Construction algorithms

Algorithm	Time	Space
Naive sort	`O(n² log n)`	`O(n)`
Prefix doubling	`O(n log n)`	`O(n)`
SA-IS / DC3	`O(n)`	`O(n)`

LCP array

Stores the length of the longest common prefix between consecutive suffixes in sorted order. Enables:

Longest repeated substring — max(LCP)
Number of distinct substrings — n(n+1)/2 - ΣLCP
Faster pattern search — O(m + log n)

Suffix arrays use 4-8x less memory than suffix trees while supporting most of the same queries. Preferred in modern bioinformatics.

Suffix Trees

Definition

A suffix tree for string S is a compressed trie of all suffixes of S$. Every internal node (except root) has at least two children. Edge labels are substrings of S.

Key properties

Contains n leaves (one per suffix)
At most n - 1 internal nodes
Built in O(n) time (Ukkonen's algorithm)
Every substring of S corresponds to a path from the root

Operations

Operation	Time
Pattern search	`O(m)`
Longest repeated substring	`O(n)`
Longest common substring	`O(n + m)`
Distinct substrings	`O(n)`
Shortest unique substring	`O(n)`

Trade-offs

Advantage

O(m) pattern matching (no log factor), rich query support

Disadvantage

10-20x memory overhead per character; complex to implement. Suffix arrays + LCP have largely replaced suffix trees in practice.

Regular Expression Matching

NFA simulation (Thompson)

Parse regex into NFA — O(m) states
Maintain set of active states
For each input character, compute next state set
Time: O(n · m)
Space: O(m)
No pathological backtracking — always linear in n · m

DFA approach

Convert NFA to DFA (subset construction)
Each character requires exactly one state transition
Matching: O(n) — one transition per character
Construction: up to O(2^m) states (exponential blowup)
Lazy DFA builds states on demand, caching frequent ones

Backtracking engines

PCRE, Python re, Java regex
O(2^n) worst case
Patterns like (a+)+b cause catastrophic backtracking
Support backreferences (not regular)
Safe alternatives: RE2, Rust regex, Go regexp

Use Thompson NFA or lazy DFA engines for guaranteed linear-time matching. Avoid backtracking engines on untrusted input.

Approximate String Matching — Edit Distance

Edit distance (Levenshtein)

Minimum number of single-character operations (insert, delete, substitute) to transform string A into B.

edit_distance("kitten", "sitting") = 3
  kitten → sitten  (sub k→s)
  sitten → sittin  (sub e→i)
  sittin → sitting (insert g)

DP recurrence — O(nm)

DP[i][j] = min(
  DP[i-1][j] + 1,      // delete
  DP[i][j-1] + 1,      // insert
  DP[i-1][j-1] + cost  // substitute
)
// cost = 0 if A[i] == B[j]

Variants

Variant	Operations
Levenshtein	Insert, delete, substitute
Damerau-Levenshtein	+ transposition of adjacent chars
Hamming	Substitution only (equal-length)
LCS distance	Insert and delete only

Applications

Spell checkers and autocorrect
Fuzzy search (find strings within edit distance k)
DNA/protein sequence alignment
OCR post-correction

Edit distance DP is the foundation of bioinformatics alignment algorithms like Smith-Waterman and Needleman-Wunsch.

Longest Common Substring

Problem

Given two strings A (length n) and B (length m), find the longest string that is a contiguous substring of both.

DP approach — O(nm)

DP[i][j] = DP[i-1][j-1] + 1  if A[i]==B[j]
         = 0                   otherwise

Answer = max(DP[i][j]) for all i, j

Suffix array approach

Concatenate A$B# (unique separators)
Build suffix array and LCP array
Answer = maximum LCP between adjacent suffixes from different strings

Comparison of methods

Method	Time	Space
DP	`O(nm)`	`O(nm)` or `O(m)`
Suffix array	`O((n+m) log(n+m))`	`O(n+m)`
Suffix tree	`O(n+m)`	`O(n+m)`

Longest common subsequence (LCS)

Characters need not be contiguous. O(nm) DP. Used by diff tools.

Shortest common superstring

NP-hard in general. Greedy approximation used in genome assembly.

Longest common substring is a building block for diff tools, plagiarism detection, and DNA sequence comparison.

Application — Text Editors & Search

Find & replace

Every text editor uses string matching. The choice of algorithm depends on the use case:

Short patterns, interactive: naive or simplified Boyer-Moore
Long patterns, large files: full Boyer-Moore — sub-linear scanning
Regex search: Thompson NFA or lazy DFA (Vim, VS Code, ripgrep)

Full-text search engines

Elasticsearch, Lucene, and Typesense build inverted indexes mapping terms to document IDs. At query time, they intersect posting lists — multi-pattern matching at scale.

grep and ripgrep

Tool	Algorithm	Key feature
`grep -F`	Aho-Corasick	Multi-pattern literal
`grep -E`	DFA regex	POSIX extended regex
GNU `grep`	Boyer-Moore + regex	Literal prefix optimisation
`ripgrep`	Aho-Corasick + lazy DFA	Parallel, .gitignore-aware

ripgrep demonstrates modern string matching: literal optimisations (Aho-Corasick for multiple literals, memchr for single bytes) layered with regex when needed.

Application — DNA Sequencing

The alignment problem

DNA sequencing produces millions of short reads (50-300 bp). Each must be aligned to a reference genome (3 billion bp for humans). Speed is critical.

BWT-based aligners

Modern aligners (BWA, Bowtie2) use the Burrows-Wheeler Transform and FM-index — compressed representations of suffix arrays supporting O(m) exact matching with minimal memory.

Approximate alignment

Seed-and-extend: find exact short matches, extend with edit-distance DP
Smith-Waterman: local alignment with affine gap penalties
BLAST: heuristic seed-based search with statistical scoring

Scale

Metric	Value
Human genome	~3.2 billion base pairs
Typical sequencing run	100M – 1B reads
Alignment speed (BWA-MEM2)	~60M reads/hour

Key data structures

FM-index

Compressed suffix array with BWT. Supports exact pattern matching in O(m) time with ~1.5 bytes/character memory.

Wavelet tree

Enables rank and select queries on the BWT in O(log |Σ|) time. Core building block of modern compressed indexes.

Without BWT/FM-index, whole-genome sequencing would be computationally infeasible.

Application — Plagiarism Detection

Document fingerprinting

Break documents into overlapping k-grams (k consecutive characters or words), hash each, and compare fingerprint sets between documents.

Winnowing algorithm

Select a subset of k-gram hashes using a sliding window minimum. Guarantees:

Any match of length ≥ t is detected
Fingerprint density is bounded — not every hash is stored

MOSS

Stanford's plagiarism detector for code. Uses winnowing with language-aware tokenisation (ignoring variable names, whitespace, comments).

Similarity metrics

Metric	Formula
Jaccard	`\|A ∩ B\| / \|A ∪ B\|`
Containment	`\|A ∩ B\| / \|A\|`
Cosine	`A · B / (\|A\| · \|B\|)` on TF-IDF

Pipeline

1→Tokenise and normalise

2→Generate k-gram hashes (Rabin-Karp)

3→Select fingerprints (winnowing)

4→Compare fingerprint sets

Rabin-Karp's rolling hash is the engine behind most modern plagiarism detection systems.

Complexity Comparison & Summary

Algorithm	Preprocess	Matching	Space	Best for
Naive	`O(0)`	`O(nm)`	`O(1)`	Short patterns, one-off search
KMP	`O(m)`	`O(n)`	`O(m)`	Streaming, worst-case guarantee
Rabin-Karp	`O(m)`	`O(n)` avg	`O(1)`	Multi-pattern, fingerprinting
Boyer-Moore	`O(m+\|Σ\|)`	`O(n/m)` avg	`O(m+\|Σ\|)`	Long patterns, large alphabets
Z-Algorithm	`O(m)`	`O(n)`	`O(n+m)`	Simple implementation
Aho-Corasick	`O(Σm)`	`O(n+z)`	`O(Σm)`	Many patterns simultaneously
Suffix Array	`O(n)`	`O(m log n)`	`O(n)`	Many queries on same text
Suffix Tree	`O(n)`	`O(m)`	`O(n)`	Rich substring queries

Key takeaways

No single algorithm wins everywhere — choice depends on alphabet size, pattern length, number of patterns, and whether text is static or streaming
KMP and Z-algorithm give O(n + m) worst-case guarantees for single-pattern matching
Boyer-Moore is fastest in practice for single long patterns on natural text
Aho-Corasick is the standard for multi-pattern matching

Suffix arrays and trees trade preprocessing for fast repeated queries
Approximate matching is fundamentally O(nm) but accelerated with bit-parallelism and filtering
Real systems combine algorithms — ripgrep uses Aho-Corasick, memchr, and lazy DFA depending on the query

String MatchingAlgorithms

The String Matching Problem

Definition

Why it matters

Terminology

Naive String Matching

The brute-force approach

When it is good enough

Complexity

Worst case

Best case

Average case

Space

KMP — Intuition & Failure Function

Core insight

The failure function (prefix table)

Building the prefix table — O(m)

KMP — Algorithm Walkthrough

The algorithm

Trace example

Complexity

Rabin-Karp — Rolling Hash

Idea

Rolling hash (polynomial)

Sliding the window

Algorithm sketch

Key strength

Rabin-Karp — Spurious Hits & Multi-Pattern

Spurious hits

Complexity

Multi-pattern variant

Practical tips

Boyer-Moore — Bad Character Rule

Why Boyer-Moore?

Bad character rule

Example

Bad character table

Boyer-Moore — Good Suffix Rule

Good suffix rule

Combined shift

Complexity

In practice

Z-Algorithm

Z-array definition

Algorithm — O(n)

String matching with Z-algorithm

Complexity

Aho-Corasick — Multi-Pattern Matching

Problem

Construction

Example

Search — O(n + z)

Complexity

Used in

Suffix Arrays

Definition

Pattern search

Construction algorithms

LCP array

Suffix Trees

Definition

Key properties

Operations

Trade-offs

Advantage

Disadvantage

Regular Expression Matching

NFA simulation (Thompson)

DFA approach

Backtracking engines

Approximate String Matching — Edit Distance

Edit distance (Levenshtein)

DP recurrence — O(nm)

Variants

Applications

Longest Common Substring

Problem

DP approach — O(nm)

Suffix array approach

Comparison of methods

String Matching
Algorithms