COMPUTER SCIENCE FUNDAMENTALS SERIES

String Matching
Algorithms

KMP · Rabin-Karp · Boyer-Moore · Suffix arrays ·
Aho-Corasick · Pattern matching
Mid-level software engineer track  ·  20 slides
02

The String Matching Problem

Definition

Given a text T of length n and a pattern P of length m, find all occurrences of P in T.

  • Input: text T[0..n-1], pattern P[0..m-1]
  • Output: all positions i where T[i..i+m-1] == P

Why it matters

Every time you press Ctrl+F, run grep, query a database with LIKE, or align a DNA sequence, a string matching algorithm is at work.

Terminology

TermMeaning
TextThe string being searched (haystack)
PatternThe string being searched for (needle)
AlphabetThe set of valid characters, size |Σ|
ShiftAn alignment position of pattern against text
Valid shiftA shift where every character matches

The naive approach checks every shift. Efficient algorithms skip provably invalid shifts.

03

Naive String Matching

The brute-force approach

For each position i in the text, compare P[0..m-1] against T[i..i+m-1] character by character.

NAIVE-STRING-MATCH(T, P):
  n = len(T), m = len(P)
  for i = 0 to n - m:
    if T[i..i+m-1] == P[0..m-1]:
      report match at i

When it is good enough

  • Short patterns (m < 10) on moderate text
  • Large alphabets (English text) where mismatches occur early
  • One-off searches where preprocessing overhead is not justified

Complexity

Worst case

O(n · m) — e.g. T = "aaaaaa", P = "aaab"

Best case

O(n) — first character of pattern rarely matches

Average case

O(n) — for random text over a large alphabet

Space

O(1) — no preprocessing required

The naive algorithm discards all information from a partial match. Every advanced algorithm exploits that wasted knowledge.

04

KMP — Intuition & Failure Function

Core insight

When a mismatch occurs at P[j] after matching P[0..j-1], the naive algorithm restarts from scratch. KMP asks: what is the longest proper prefix of P[0..j-1] that is also a suffix? Resume matching from there.

The failure function (prefix table)

π[j] = length of the longest proper prefix of P[0..j] that is also a suffix.

Pattern: A B A B A C
Index:   0 1 2 3 4 5
π:       0 0 1 2 3 0

Building the prefix table — O(m)

COMPUTE-PREFIX(P):
  m = len(P)
  π[0] = 0
  k = 0
  for q = 1 to m - 1:
    while k > 0 and P[k] != P[q]:
      k = π[k - 1]
    if P[k] == P[q]:
      k = k + 1
    π[q] = k
  return π

The prefix table encodes the internal structure of the pattern so that no text character is ever re-examined.

05

KMP — Algorithm Walkthrough

The algorithm

KMP-SEARCH(T, P):
  n = len(T), m = len(P)
  π = COMPUTE-PREFIX(P)
  q = 0
  for i = 0 to n - 1:
    while q > 0 and P[q] != T[i]:
      q = π[q - 1]       // fall back
    if P[q] == T[i]:
      q = q + 1
    if q == m:
      report match at i - m + 1
      q = π[q - 1]       // next match

Trace example

Text:    A B A B A B A C A B
Pattern: A B A B A C
π:       0 0 1 2 3 0

i=4: matched ABABA, mismatch P[5]='C' vs T[5]='B'
     q = π[4] = 3, resume P[3] with T[5]
     → skipped re-examining T[0..2]

Complexity

MetricValue
PreprocessingO(m) — build prefix table
MatchingO(n) — each character examined at most twice
TotalO(n + m)
SpaceO(m) — prefix table

KMP guarantees exactly O(n) comparisons in the worst case — no algorithm can do fewer for single-pattern exact matching.

06

Rabin-Karp — Rolling Hash

Idea

Instead of comparing characters, compute a hash of each m-length window and compare it to the hash of the pattern. A rolling hash updates in O(1) per shift.

Rolling hash (polynomial)

hash(S[i..i+m-1]) =
  Σ S[i+j] · d^(m-1-j) mod q

d = |Σ| (alphabet size)
q = large prime

Sliding the window

hash(S[i+1..i+m]) =
  (hash(S[i..i+m-1])
   - S[i] · d^(m-1)) · d
   + S[i+m]  mod q

Algorithm sketch

RABIN-KARP(T, P):
  hp = hash(P[0..m-1])
  ht = hash(T[0..m-1])
  for i = 0 to n - m:
    if ht == hp:
      if T[i..i+m-1] == P:  // verify
        report match at i
    ht = roll(ht, T[i], T[i+m])

Key strength

Natural extension to multi-pattern search — hash each pattern, store in a set, check every window against the set.

07

Rabin-Karp — Spurious Hits & Multi-Pattern

Spurious hits

Two different strings can have the same hash (collision). Each collision triggers a full O(m) character comparison. With a good prime q, the expected number of spurious hits is O(n/q).

Complexity

CaseTime
Best / averageO(n + m)
Worst caseO(n · m) — many collisions
SpaceO(1) beyond input

Multi-pattern variant

To search for k patterns simultaneously:

  1. Hash all k patterns and store in a hash set
  2. Slide window over text; check each window hash against the set
  3. Expected time: O(n + km)

Practical tips

  • Use double hashing (two independent hash functions) to reduce false positives to near zero
  • Rabin-Karp is the basis of many plagiarism detectors (compare document fingerprints)
  • Works well when pattern length is fixed and known

Rabin-Karp trades deterministic guarantees for practical speed and multi-pattern flexibility.

08

Boyer-Moore — Bad Character Rule

Why Boyer-Moore?

Often the fastest single-pattern algorithm in practice. Scans the pattern right to left and uses two heuristics to skip large portions of text.

Bad character rule

When a mismatch occurs at text character c and pattern position j:

  • If c does not appear in the pattern → shift pattern past the mismatch entirely
  • If c appears at position k < j → align that occurrence with the text

Example

Text:    ...X E R C I S E...
Pattern:    E X A M P L E
            ←───── comparing right to left
Mismatch: T[3]='R' vs P[3]='M'
'R' not in pattern
  → shift pattern 4 positions right

Bad character table

BUILD-BAD-CHAR(P):
  for each c in Σ: bc[c] = -1
  for j = 0 to m - 1:
    bc[P[j]] = j
  return bc

The bad character rule alone gives sub-linear average performance on large alphabets.

09

Boyer-Moore — Good Suffix Rule

Good suffix rule

When characters P[j+1..m-1] have matched but P[j] mismatches:

  • Find the rightmost occurrence of the matched suffix elsewhere in the pattern
  • If no full occurrence exists, find the longest prefix of the pattern that matches a suffix of the matched portion
  • Shift accordingly

Combined shift

shift = max(bad_char_shift,
            good_suffix_shift)

Boyer-Moore always takes the maximum of both heuristics — the larger the skip, the better.

Complexity

MetricValue
PreprocessingO(m + |Σ|)
Best caseO(n/m) — sub-linear!
Worst caseO(n · m)O(n) with Galil
AverageO(n/m) for large alphabets

In practice

  • Default algorithm in GNU grep
  • Faster than KMP for long patterns on natural-language text
  • The O(n/m) best case means longer patterns can actually be faster to find

Boyer-Moore is the rare algorithm where increasing the pattern length improves performance.

10

Z-Algorithm

Z-array definition

For a string S, Z[i] = length of the longest substring starting at S[i] that matches a prefix of S.

S:  a a b x a a b
Z:  - 1 0 0 3 1 0

Algorithm — O(n)

Maintain a Z-box [l, r] representing the rightmost interval matching a prefix. For each position i:

  • If i > r, compute Z[i] naively and update [l, r]
  • If i ≤ r, use previously computed values to initialise Z[i], then extend if needed

String matching with Z-algorithm

Concatenate P$T (where $ is not in the alphabet) and compute the Z-array. Any position i where Z[i] == m is a match.

P = "aab", T = "aabxaab"
S = "aab$aabxaab"
Z = [- 1 0 0 3 1 0 0 3 1 0]
  matches at i=4 and i=8
  → text positions 0 and 4

Complexity

MetricValue
TimeO(n + m)
SpaceO(n + m)

The Z-algorithm is conceptually simpler than KMP and equally powerful. It is also the foundation for many suffix-based constructions.

11

Aho-Corasick — Multi-Pattern Matching

Problem

Given k patterns and text of length n, find all occurrences of all patterns in one pass.

Construction

  1. Build a trie from all k patterns
  2. Add failure links — analogous to KMP's failure function; on mismatch, follow the failure link to the longest proper suffix that is also a prefix of some pattern
  3. Add output links — chain patterns that end at the same trie node via suffixes

Example

Patterns: {he, she, his, hers}
Text:     "ushers"

Matches: she (at 1), he (at 2),
         hers (at 2)

Search — O(n + z)

Feed the text through the automaton character by character. At each state, follow output links to report all matching patterns. z = total matches reported.

Complexity

MetricValue
BuildO(m · |Σ|) where m = total pattern length
SearchO(n + z) — linear in text + output
SpaceO(m · |Σ|) for the automaton

Used in

Snort IDS fgrep virus scanners ripgrep

Aho-Corasick is the multi-pattern generalisation of KMP.

12

Suffix Arrays

Definition

A suffix array SA for string S of length n is a sorted array of all suffix indices ordered by the lexicographic order of the suffixes they represent.

S = "banana$"
Suffixes sorted:     SA:
$                    6
a$                   5
ana$                 3
anana$               1
banana$              0
na$                  4
nana$                2

Pattern search

Binary search the suffix array. Each comparison is O(m), so search is O(m log n). With an LCP array, this improves to O(m + log n).

Construction algorithms

AlgorithmTimeSpace
Naive sortO(n² log n)O(n)
Prefix doublingO(n log n)O(n)
SA-IS / DC3O(n)O(n)

LCP array

Stores the length of the longest common prefix between consecutive suffixes in sorted order. Enables:

  • Longest repeated substring — max(LCP)
  • Number of distinct substrings — n(n+1)/2 - ΣLCP
  • Faster pattern search — O(m + log n)

Suffix arrays use 4-8x less memory than suffix trees while supporting most of the same queries. Preferred in modern bioinformatics.

13

Suffix Trees

Definition

A suffix tree for string S is a compressed trie of all suffixes of S$. Every internal node (except root) has at least two children. Edge labels are substrings of S.

Key properties

  • Contains n leaves (one per suffix)
  • At most n - 1 internal nodes
  • Built in O(n) time (Ukkonen's algorithm)
  • Every substring of S corresponds to a path from the root

Operations

OperationTime
Pattern searchO(m)
Longest repeated substringO(n)
Longest common substringO(n + m)
Distinct substringsO(n)
Shortest unique substringO(n)

Trade-offs

Advantage

O(m) pattern matching (no log factor), rich query support

Disadvantage

10-20x memory overhead per character; complex to implement. Suffix arrays + LCP have largely replaced suffix trees in practice.

14

Regular Expression Matching

NFA simulation (Thompson)

  • Parse regex into NFA — O(m) states
  • Maintain set of active states
  • For each input character, compute next state set
  • Time: O(n · m)
  • Space: O(m)
  • No pathological backtracking — always linear in n · m

DFA approach

  • Convert NFA to DFA (subset construction)
  • Each character requires exactly one state transition
  • Matching: O(n) — one transition per character
  • Construction: up to O(2^m) states (exponential blowup)
  • Lazy DFA builds states on demand, caching frequent ones

Backtracking engines

  • PCRE, Python re, Java regex
  • O(2^n) worst case
  • Patterns like (a+)+b cause catastrophic backtracking
  • Support backreferences (not regular)
  • Safe alternatives: RE2, Rust regex, Go regexp

Use Thompson NFA or lazy DFA engines for guaranteed linear-time matching. Avoid backtracking engines on untrusted input.

15

Approximate String Matching — Edit Distance

Edit distance (Levenshtein)

Minimum number of single-character operations (insert, delete, substitute) to transform string A into B.

edit_distance("kitten", "sitting") = 3
  kitten → sitten  (sub k→s)
  sitten → sittin  (sub e→i)
  sittin → sitting (insert g)

DP recurrence — O(nm)

DP[i][j] = min(
  DP[i-1][j] + 1,      // delete
  DP[i][j-1] + 1,      // insert
  DP[i-1][j-1] + cost  // substitute
)
// cost = 0 if A[i] == B[j]

Variants

VariantOperations
LevenshteinInsert, delete, substitute
Damerau-Levenshtein+ transposition of adjacent chars
HammingSubstitution only (equal-length)
LCS distanceInsert and delete only

Applications

  • Spell checkers and autocorrect
  • Fuzzy search (find strings within edit distance k)
  • DNA/protein sequence alignment
  • OCR post-correction

Edit distance DP is the foundation of bioinformatics alignment algorithms like Smith-Waterman and Needleman-Wunsch.

16

Longest Common Substring

Problem

Given two strings A (length n) and B (length m), find the longest string that is a contiguous substring of both.

DP approach — O(nm)

DP[i][j] = DP[i-1][j-1] + 1  if A[i]==B[j]
         = 0                   otherwise

Answer = max(DP[i][j]) for all i, j

Suffix array approach

  1. Concatenate A$B# (unique separators)
  2. Build suffix array and LCP array
  3. Answer = maximum LCP between adjacent suffixes from different strings

Comparison of methods

MethodTimeSpace
DPO(nm)O(nm) or O(m)
Suffix arrayO((n+m) log(n+m))O(n+m)
Suffix treeO(n+m)O(n+m)

Related problems

Longest common subsequence (LCS)

Characters need not be contiguous. O(nm) DP. Used by diff tools.

Shortest common superstring

NP-hard in general. Greedy approximation used in genome assembly.

Longest common substring is a building block for diff tools, plagiarism detection, and DNA sequence comparison.

17

Application — Text Editors & Search

Find & replace

Every text editor uses string matching. The choice of algorithm depends on the use case:

  • Short patterns, interactive: naive or simplified Boyer-Moore
  • Long patterns, large files: full Boyer-Moore — sub-linear scanning
  • Regex search: Thompson NFA or lazy DFA (Vim, VS Code, ripgrep)

Full-text search engines

Elasticsearch, Lucene, and Typesense build inverted indexes mapping terms to document IDs. At query time, they intersect posting lists — multi-pattern matching at scale.

grep and ripgrep

ToolAlgorithmKey feature
grep -FAho-CorasickMulti-pattern literal
grep -EDFA regexPOSIX extended regex
GNU grepBoyer-Moore + regexLiteral prefix optimisation
ripgrepAho-Corasick + lazy DFAParallel, .gitignore-aware

ripgrep demonstrates modern string matching: literal optimisations (Aho-Corasick for multiple literals, memchr for single bytes) layered with regex when needed.

18

Application — DNA Sequencing

The alignment problem

DNA sequencing produces millions of short reads (50-300 bp). Each must be aligned to a reference genome (3 billion bp for humans). Speed is critical.

BWT-based aligners

Modern aligners (BWA, Bowtie2) use the Burrows-Wheeler Transform and FM-index — compressed representations of suffix arrays supporting O(m) exact matching with minimal memory.

Approximate alignment

  • Seed-and-extend: find exact short matches, extend with edit-distance DP
  • Smith-Waterman: local alignment with affine gap penalties
  • BLAST: heuristic seed-based search with statistical scoring

Scale

MetricValue
Human genome~3.2 billion base pairs
Typical sequencing run100M – 1B reads
Alignment speed (BWA-MEM2)~60M reads/hour

Key data structures

FM-index

Compressed suffix array with BWT. Supports exact pattern matching in O(m) time with ~1.5 bytes/character memory.

Wavelet tree

Enables rank and select queries on the BWT in O(log |Σ|) time. Core building block of modern compressed indexes.

Without BWT/FM-index, whole-genome sequencing would be computationally infeasible.

19

Application — Plagiarism Detection

Document fingerprinting

Break documents into overlapping k-grams (k consecutive characters or words), hash each, and compare fingerprint sets between documents.

Winnowing algorithm

Select a subset of k-gram hashes using a sliding window minimum. Guarantees:

  • Any match of length ≥ t is detected
  • Fingerprint density is bounded — not every hash is stored

MOSS

Stanford's plagiarism detector for code. Uses winnowing with language-aware tokenisation (ignoring variable names, whitespace, comments).

Similarity metrics

MetricFormula
Jaccard|A ∩ B| / |A ∪ B|
Containment|A ∩ B| / |A|
CosineA · B / (|A| · |B|) on TF-IDF

Pipeline

1Tokenise and normalise
2Generate k-gram hashes (Rabin-Karp)
3Select fingerprints (winnowing)
4Compare fingerprint sets

Rabin-Karp's rolling hash is the engine behind most modern plagiarism detection systems.

20

Complexity Comparison & Summary

AlgorithmPreprocessMatchingSpaceBest for
NaiveO(0)O(nm)O(1)Short patterns, one-off search
KMPO(m)O(n)O(m)Streaming, worst-case guarantee
Rabin-KarpO(m)O(n) avgO(1)Multi-pattern, fingerprinting
Boyer-MooreO(m+|Σ|)O(n/m) avgO(m+|Σ|)Long patterns, large alphabets
Z-AlgorithmO(m)O(n)O(n+m)Simple implementation
Aho-CorasickO(Σm)O(n+z)O(Σm)Many patterns simultaneously
Suffix ArrayO(n)O(m log n)O(n)Many queries on same text
Suffix TreeO(n)O(m)O(n)Rich substring queries

Key takeaways

  • No single algorithm wins everywhere — choice depends on alphabet size, pattern length, number of patterns, and whether text is static or streaming
  • KMP and Z-algorithm give O(n + m) worst-case guarantees for single-pattern matching
  • Boyer-Moore is fastest in practice for single long patterns on natural text
  • Aho-Corasick is the standard for multi-pattern matching
  • Suffix arrays and trees trade preprocessing for fast repeated queries
  • Approximate matching is fundamentally O(nm) but accelerated with bit-parallelism and filtering
  • Real systems combine algorithms — ripgrep uses Aho-Corasick, memchr, and lazy DFA depending on the query

Recommended reading

CLRS (string matching chapters) · Gusfield (strings, trees, sequences) · Sedgewick & Wayne (practical implementations) · cp-algorithms.com