Given a text T of length n and a pattern P of length m, find all occurrences of P in T.
T[0..n-1], pattern P[0..m-1]i where T[i..i+m-1] == PEvery time you press Ctrl+F, run grep, query a database with LIKE, or align a DNA sequence, a string matching algorithm is at work.
| Term | Meaning |
|---|---|
| Text | The string being searched (haystack) |
| Pattern | The string being searched for (needle) |
| Alphabet | The set of valid characters, size |Σ| |
| Shift | An alignment position of pattern against text |
| Valid shift | A shift where every character matches |
The naive approach checks every shift. Efficient algorithms skip provably invalid shifts.
For each position i in the text, compare P[0..m-1] against T[i..i+m-1] character by character.
NAIVE-STRING-MATCH(T, P):
n = len(T), m = len(P)
for i = 0 to n - m:
if T[i..i+m-1] == P[0..m-1]:
report match at i
m < 10) on moderate textO(n · m) — e.g. T = "aaaaaa", P = "aaab"
O(n) — first character of pattern rarely matches
O(n) — for random text over a large alphabet
O(1) — no preprocessing required
The naive algorithm discards all information from a partial match. Every advanced algorithm exploits that wasted knowledge.
When a mismatch occurs at P[j] after matching P[0..j-1], the naive algorithm restarts from scratch. KMP asks: what is the longest proper prefix of P[0..j-1] that is also a suffix? Resume matching from there.
π[j] = length of the longest proper prefix of P[0..j] that is also a suffix.
Pattern: A B A B A C
Index: 0 1 2 3 4 5
π: 0 0 1 2 3 0
COMPUTE-PREFIX(P):
m = len(P)
π[0] = 0
k = 0
for q = 1 to m - 1:
while k > 0 and P[k] != P[q]:
k = π[k - 1]
if P[k] == P[q]:
k = k + 1
π[q] = k
return π
The prefix table encodes the internal structure of the pattern so that no text character is ever re-examined.
KMP-SEARCH(T, P):
n = len(T), m = len(P)
π = COMPUTE-PREFIX(P)
q = 0
for i = 0 to n - 1:
while q > 0 and P[q] != T[i]:
q = π[q - 1] // fall back
if P[q] == T[i]:
q = q + 1
if q == m:
report match at i - m + 1
q = π[q - 1] // next match
Text: A B A B A B A C A B
Pattern: A B A B A C
π: 0 0 1 2 3 0
i=4: matched ABABA, mismatch P[5]='C' vs T[5]='B'
q = π[4] = 3, resume P[3] with T[5]
→ skipped re-examining T[0..2]
| Metric | Value |
|---|---|
| Preprocessing | O(m) — build prefix table |
| Matching | O(n) — each character examined at most twice |
| Total | O(n + m) |
| Space | O(m) — prefix table |
KMP guarantees exactly O(n) comparisons in the worst case — no algorithm can do fewer for single-pattern exact matching.
Instead of comparing characters, compute a hash of each m-length window and compare it to the hash of the pattern. A rolling hash updates in O(1) per shift.
hash(S[i..i+m-1]) =
Σ S[i+j] · d^(m-1-j) mod q
d = |Σ| (alphabet size)
q = large prime
hash(S[i+1..i+m]) =
(hash(S[i..i+m-1])
- S[i] · d^(m-1)) · d
+ S[i+m] mod q
RABIN-KARP(T, P):
hp = hash(P[0..m-1])
ht = hash(T[0..m-1])
for i = 0 to n - m:
if ht == hp:
if T[i..i+m-1] == P: // verify
report match at i
ht = roll(ht, T[i], T[i+m])
Natural extension to multi-pattern search — hash each pattern, store in a set, check every window against the set.
Two different strings can have the same hash (collision). Each collision triggers a full O(m) character comparison. With a good prime q, the expected number of spurious hits is O(n/q).
| Case | Time |
|---|---|
| Best / average | O(n + m) |
| Worst case | O(n · m) — many collisions |
| Space | O(1) beyond input |
To search for k patterns simultaneously:
k patterns and store in a hash setO(n + km)Rabin-Karp trades deterministic guarantees for practical speed and multi-pattern flexibility.
Often the fastest single-pattern algorithm in practice. Scans the pattern right to left and uses two heuristics to skip large portions of text.
When a mismatch occurs at text character c and pattern position j:
c does not appear in the pattern → shift pattern past the mismatch entirelyc appears at position k < j → align that occurrence with the textText: ...X E R C I S E...
Pattern: E X A M P L E
←───── comparing right to left
Mismatch: T[3]='R' vs P[3]='M'
'R' not in pattern
→ shift pattern 4 positions right
BUILD-BAD-CHAR(P):
for each c in Σ: bc[c] = -1
for j = 0 to m - 1:
bc[P[j]] = j
return bc
The bad character rule alone gives sub-linear average performance on large alphabets.
When characters P[j+1..m-1] have matched but P[j] mismatches:
shift = max(bad_char_shift,
good_suffix_shift)
Boyer-Moore always takes the maximum of both heuristics — the larger the skip, the better.
| Metric | Value |
|---|---|
| Preprocessing | O(m + |Σ|) |
| Best case | O(n/m) — sub-linear! |
| Worst case | O(n · m) — O(n) with Galil |
| Average | O(n/m) for large alphabets |
grepO(n/m) best case means longer patterns can actually be faster to findBoyer-Moore is the rare algorithm where increasing the pattern length improves performance.
For a string S, Z[i] = length of the longest substring starting at S[i] that matches a prefix of S.
S: a a b x a a b
Z: - 1 0 0 3 1 0
Maintain a Z-box [l, r] representing the rightmost interval matching a prefix. For each position i:
i > r, compute Z[i] naively and update [l, r]i ≤ r, use previously computed values to initialise Z[i], then extend if neededConcatenate P$T (where $ is not in the alphabet) and compute the Z-array. Any position i where Z[i] == m is a match.
P = "aab", T = "aabxaab"
S = "aab$aabxaab"
Z = [- 1 0 0 3 1 0 0 3 1 0]
matches at i=4 and i=8
→ text positions 0 and 4
| Metric | Value |
|---|---|
| Time | O(n + m) |
| Space | O(n + m) |
The Z-algorithm is conceptually simpler than KMP and equally powerful. It is also the foundation for many suffix-based constructions.
Given k patterns and text of length n, find all occurrences of all patterns in one pass.
k patternsPatterns: {he, she, his, hers}
Text: "ushers"
Matches: she (at 1), he (at 2),
hers (at 2)
Feed the text through the automaton character by character. At each state, follow output links to report all matching patterns. z = total matches reported.
| Metric | Value |
|---|---|
| Build | O(m · |Σ|) where m = total pattern length |
| Search | O(n + z) — linear in text + output |
| Space | O(m · |Σ|) for the automaton |
Aho-Corasick is the multi-pattern generalisation of KMP.
A suffix array SA for string S of length n is a sorted array of all suffix indices ordered by the lexicographic order of the suffixes they represent.
S = "banana$"
Suffixes sorted: SA:
$ 6
a$ 5
ana$ 3
anana$ 1
banana$ 0
na$ 4
nana$ 2
Binary search the suffix array. Each comparison is O(m), so search is O(m log n). With an LCP array, this improves to O(m + log n).
| Algorithm | Time | Space |
|---|---|---|
| Naive sort | O(n² log n) | O(n) |
| Prefix doubling | O(n log n) | O(n) |
| SA-IS / DC3 | O(n) | O(n) |
Stores the length of the longest common prefix between consecutive suffixes in sorted order. Enables:
max(LCP)n(n+1)/2 - ΣLCPO(m + log n)Suffix arrays use 4-8x less memory than suffix trees while supporting most of the same queries. Preferred in modern bioinformatics.
A suffix tree for string S is a compressed trie of all suffixes of S$. Every internal node (except root) has at least two children. Edge labels are substrings of S.
n leaves (one per suffix)n - 1 internal nodesO(n) time (Ukkonen's algorithm)S corresponds to a path from the root| Operation | Time |
|---|---|
| Pattern search | O(m) |
| Longest repeated substring | O(n) |
| Longest common substring | O(n + m) |
| Distinct substrings | O(n) |
| Shortest unique substring | O(n) |
O(m) pattern matching (no log factor), rich query support
10-20x memory overhead per character; complex to implement. Suffix arrays + LCP have largely replaced suffix trees in practice.
O(m) statesO(n · m)O(m)n · mO(n) — one transition per characterO(2^m) states (exponential blowup)re, Java regexO(2^n) worst case(a+)+b cause catastrophic backtrackingregex, Go regexpUse Thompson NFA or lazy DFA engines for guaranteed linear-time matching. Avoid backtracking engines on untrusted input.
Minimum number of single-character operations (insert, delete, substitute) to transform string A into B.
edit_distance("kitten", "sitting") = 3
kitten → sitten (sub k→s)
sitten → sittin (sub e→i)
sittin → sitting (insert g)
DP[i][j] = min(
DP[i-1][j] + 1, // delete
DP[i][j-1] + 1, // insert
DP[i-1][j-1] + cost // substitute
)
// cost = 0 if A[i] == B[j]
| Variant | Operations |
|---|---|
| Levenshtein | Insert, delete, substitute |
| Damerau-Levenshtein | + transposition of adjacent chars |
| Hamming | Substitution only (equal-length) |
| LCS distance | Insert and delete only |
k)Edit distance DP is the foundation of bioinformatics alignment algorithms like Smith-Waterman and Needleman-Wunsch.
Given two strings A (length n) and B (length m), find the longest string that is a contiguous substring of both.
DP[i][j] = DP[i-1][j-1] + 1 if A[i]==B[j]
= 0 otherwise
Answer = max(DP[i][j]) for all i, j
A$B# (unique separators)| Method | Time | Space |
|---|---|---|
| DP | O(nm) | O(nm) or O(m) |
| Suffix array | O((n+m) log(n+m)) | O(n+m) |
| Suffix tree | O(n+m) | O(n+m) |
Characters need not be contiguous. O(nm) DP. Used by diff tools.
NP-hard in general. Greedy approximation used in genome assembly.
Longest common substring is a building block for diff tools, plagiarism detection, and DNA sequence comparison.
Every text editor uses string matching. The choice of algorithm depends on the use case:
Elasticsearch, Lucene, and Typesense build inverted indexes mapping terms to document IDs. At query time, they intersect posting lists — multi-pattern matching at scale.
| Tool | Algorithm | Key feature |
|---|---|---|
grep -F | Aho-Corasick | Multi-pattern literal |
grep -E | DFA regex | POSIX extended regex |
GNU grep | Boyer-Moore + regex | Literal prefix optimisation |
ripgrep | Aho-Corasick + lazy DFA | Parallel, .gitignore-aware |
ripgrep demonstrates modern string matching: literal optimisations (Aho-Corasick for multiple literals, memchr for single bytes) layered with regex when needed.
DNA sequencing produces millions of short reads (50-300 bp). Each must be aligned to a reference genome (3 billion bp for humans). Speed is critical.
Modern aligners (BWA, Bowtie2) use the Burrows-Wheeler Transform and FM-index — compressed representations of suffix arrays supporting O(m) exact matching with minimal memory.
| Metric | Value |
|---|---|
| Human genome | ~3.2 billion base pairs |
| Typical sequencing run | 100M – 1B reads |
| Alignment speed (BWA-MEM2) | ~60M reads/hour |
Compressed suffix array with BWT. Supports exact pattern matching in O(m) time with ~1.5 bytes/character memory.
Enables rank and select queries on the BWT in O(log |Σ|) time. Core building block of modern compressed indexes.
Without BWT/FM-index, whole-genome sequencing would be computationally infeasible.
Break documents into overlapping k-grams (k consecutive characters or words), hash each, and compare fingerprint sets between documents.
Select a subset of k-gram hashes using a sliding window minimum. Guarantees:
t is detectedStanford's plagiarism detector for code. Uses winnowing with language-aware tokenisation (ignoring variable names, whitespace, comments).
| Metric | Formula |
|---|---|
| Jaccard | |A ∩ B| / |A ∪ B| |
| Containment | |A ∩ B| / |A| |
| Cosine | A · B / (|A| · |B|) on TF-IDF |
Rabin-Karp's rolling hash is the engine behind most modern plagiarism detection systems.
| Algorithm | Preprocess | Matching | Space | Best for |
|---|---|---|---|---|
| Naive | O(0) | O(nm) | O(1) | Short patterns, one-off search |
| KMP | O(m) | O(n) | O(m) | Streaming, worst-case guarantee |
| Rabin-Karp | O(m) | O(n) avg | O(1) | Multi-pattern, fingerprinting |
| Boyer-Moore | O(m+|Σ|) | O(n/m) avg | O(m+|Σ|) | Long patterns, large alphabets |
| Z-Algorithm | O(m) | O(n) | O(n+m) | Simple implementation |
| Aho-Corasick | O(Σm) | O(n+z) | O(Σm) | Many patterns simultaneously |
| Suffix Array | O(n) | O(m log n) | O(n) | Many queries on same text |
| Suffix Tree | O(n) | O(m) | O(n) | Rich substring queries |
O(n + m) worst-case guarantees for single-pattern matchingO(nm) but accelerated with bit-parallelism and filteringCLRS (string matching chapters) · Gusfield (strings, trees, sequences) · Sedgewick & Wayne (practical implementations) · cp-algorithms.com