Main memory is fast but limited. Databases must store far more data than fits in RAM, so most data lives on disk (SSD or HDD).
The dominant cost in disk-based data structures is the number of disk page reads/writes. An ideal index structure:
Design goal: minimise the number of disk I/Os per operation — not CPU comparisons. This is the fundamental insight behind B-trees.
A B-tree of order m (also called minimum degree t where m = 2t) satisfies these properties (Knuth, 1973):
| Property | Description |
|---|---|
| Root | Has between 1 and m − 1 keys (may have as few as 0 children if it is a leaf) |
| Internal nodes | Each non-root internal node has between ⌈m/2⌉ and m children |
| Keys per node | Each node with k children stores exactly k − 1 keys |
| Sorted keys | Keys within each node are in ascending order |
| Balanced | All leaves appear on the same level |
| Subtree ordering | For key Ki, all keys in the left subtree < Ki and all keys in the right subtree > Ki |
Key insight: A B-tree of order 1000 with 1 billion keys has height ≤ 3. Three disk reads to find any key among a billion.
Each node occupies one disk page and contains interleaved keys and child pointers:
+----+----+----+----+----+----+----+----+----+
| P0 | K1 | P1 | K2 | P2 | K3 | P3 |... | Pn |
+----+----+----+----+----+----+----+----+----+
Ki = key i (sorted)
Pi = pointer to child i (disk page address)
| Parameter | Typical Value |
|---|---|
| Page size | 8 KB – 16 KB |
| Key size | 8 – 64 bytes |
| Pointer size | 6 – 8 bytes |
| Keys per node | 100 – 500+ |
| Fanout | 101 – 501+ |
High fanout is the key advantage. A binary tree with 1 million keys has height ~20. A B-tree with fanout 500 has height ~3.
BTREE-SEARCH(node, key):
i = 0
while i < node.num_keys
and key > node.keys[i]:
i += 1
if i < node.num_keys
and key == node.keys[i]:
return (node, i) // found
if node.is_leaf:
return NOT_FOUND
DISK-READ(node.children[i]) // 1 I/O
return BTREE-SEARCH(
node.children[i], key)
Instead of splitting bottom-up after overflow, split any full node encountered on the way down during the search:
Proactive splitting is the standard approach. It avoids the complexity of bottom-up split propagation and simplifies concurrent access.
Insert 25 into a full leaf:
Before: [10 | 20 | 30 | 40] ← full
Step 1: Split at median (30)
[30] ← promoted
/ \
[10|20] [30|40]
Step 2: Insert 25
[30]
/ \
[10|20|25] [30|40] ← done
Root splits: When the root is full and must split, a new root is created with the promoted median. This is the only way a B-tree grows taller.
SPLIT-CHILD(parent, i):
full = parent.children[i] // m-1 keys
median = full.keys[t-1]
new = allocate_page()
// Right half → new node
new.keys = full.keys[t .. m-1]
new.children = full.children[t .. m]
// Left half stays
full.num_keys = t - 1
// Promote median to parent
insert median at parent.keys[i]
parent.children[i+1] = new
DISK-WRITE(full) // 3 writes
DISK-WRITE(new) // per split
DISK-WRITE(parent)
| Operation | Count |
|---|---|
| Disk reads (search path) | O(logt n) |
| Disk writes (per split) | 3 |
| Splits per insertion (worst) | O(logt n) |
| Amortised splits | O(1) |
In practice, splits are rare. Most insertions just add a key to a non-full leaf — one disk write.
| Case | Situation | Action |
|---|---|---|
| 1 | Key found in a leaf node | Remove the key directly |
| 2 | Key found in an internal node | Replace with predecessor (or successor) from a leaf, then delete from leaf |
| 3 | Key not in current node | Ensure child has ≥ t keys before recursing (fix underflow proactively) |
When descending to a child that has only t − 1 keys (minimum), we must ensure it has at least t keys before entering it:
If an adjacent sibling has ≥ t keys, rotate a key through the parent into the deficient child.
If both adjacent siblings have exactly t − 1 keys, merge the child with a sibling and pull the separator key down from the parent.
Like insertion, deletion uses proactive fixing — we ensure each node on the path down has enough keys so we never need to backtrack.
Before (right sibling has spare keys):
Parent: [...| 30 |...]
/ \
Child: [10|20] [35|40|50|60]
After rotation:
Parent: [...| 35 |...]
/ \
Child: [10|20|30] [40|50|60]
Key 30 demoted, key 35 promoted.
Before (both have t-1 keys):
Parent: [...| 30 |...]
/ \
Child: [10|20] [35|40]
After merge:
Parent: [...] ← 30 pulled down
|
Merged: [10|20|30|35|40]
Tree height reduction: if the parent was the root and had only one key, the merged node becomes the new root — the tree shrinks in height. This is the only way a B-tree gets shorter.
In a B+ tree, all data records live in the leaf nodes. Internal nodes store only keys and child pointers — a pure index.
| Feature | B-Tree | B+ Tree |
|---|---|---|
| Data location | Any node | Leaves only |
| Leaf linking | None | Doubly-linked list |
| Internal node keys | Separator + data | Copies / separators only |
| Fanout | Lower (data in internal nodes) | Higher (index-only internals) |
| Range queries | Requires tree traversal | Follow leaf pointers |
| Point query | May terminate at internal node | Always reaches a leaf |
RANGE-QUERY(tree, low, high):
leaf = search for leaf containing low
results = []
while leaf is not null:
for each (key, val) in leaf:
if key > high:
return results
if key >= low:
results.append((key, val))
leaf = leaf.next_leaf
return results
| Operation | Cost |
|---|---|
| Find starting leaf | O(logt n) |
| Scan matching leaves | O(k / b) |
| Total | O(logt n + k/b) |
k = result size, b = keys per leaf
SELECT * FROM orders
WHERE order_date
BETWEEN '2025-01-01' AND '2025-03-31'
ORDER BY order_date;
order_date: find the first matching leaf, then scan sequentiallyThis is why ORDER BY on an indexed column is essentially free when the query uses that index.
A B* tree variant where every non-root node must be at least 2/3 full (vs 1/2 in a standard B-tree).
| Property | B-Tree | B* Tree |
|---|---|---|
| Minimum fill | 1/2 | 2/3 |
| Split strategy | Split 1 → 2 | Redistribute, then split 2 → 3 |
| Space utilisation | ~50–69% | ~66–81% |
| Average fanout | High | Even higher |
When node overflows and sibling is full:
Before: [A B C D] [E F G H] ← both full
Redistribute across three nodes:
After: [A B C] [D E] [F G H]
separator keys promoted
HFS+ (macOS) historically used B* trees for its catalog file.
Height bound: for a B-tree of order m (minimum degree t = ⌈m/2⌉) with n keys: h ≤ logt((n + 1) / 2)
| Keys (n) | Order (m) | Min degree (t) | Max height | Disk reads |
|---|---|---|---|---|
| 1,000 | 100 | 50 | 2 | 2 |
| 1,000,000 | 100 | 50 | 4 | 4 |
| 1,000,000,000 | 100 | 50 | 6 | 6 |
| 1,000,000 | 500 | 250 | 3 | 3 |
| 1,000,000,000 | 500 | 250 | 4 | 4 |
| 1,000,000,000 | 1000 | 500 | 4 | 4 |
Balanced BST with 109 keys: height ~30 → 30 disk reads.
B-tree order 1000 with 109 keys: height ~3–4 → 3–4 disk reads.
The root node is almost always cached in memory. So a search in a billion-key B-tree often requires only 2–3 disk reads.
Fanout f = page_size / (key + ptr)
Example: 16 KB page, 8B keys, 8B ptrs
f = 16384 / 16 = 1024
Height for n = 10^9:
h = ceil(log_1024(10^9))
= ceil(3.0) = 3
| Structure | Search | Insert | Range (k) |
|---|---|---|---|
| Sorted array | O(logB n) | O(n/B) | O(logB n + k/B) |
| Binary tree | O(log n) | O(log n) | O(log n + k) |
| B-tree / B+ | O(logB n) | O(logB n) | O(logB n + k/B) |
| Hash table | O(1) | O(1) | O(n/B) |
B-trees achieve optimal I/O for comparison-based search. Only hashing has better point-query I/O — but hashing cannot support range queries or ordering.
INCLUDE columns for covering indexes| Database | B-Tree variant | Notes |
|---|---|---|
| SQLite | B+ tree | Both table and index structures |
| Oracle | B+ tree | Index-organised tables (IOT) option |
| SQL Server | B+ tree | Clustered and non-clustered indexes |
| MongoDB | B+ tree | WiredTiger engine, default since 3.2 |
Clustered vs heap: InnoDB's clustered index means PK range scans are sequential I/O. PostgreSQL's heap means any index scan may require random I/O to fetch heap tuples — mitigated by bitmap scans.
| File System | B-Tree usage |
|---|---|
| XFS | B+ trees for directory entries, free space, inodes, extents |
| ReiserFS | B+ tree (S+ tree) for all metadata and small file data |
| ZFS | DMU with indirect block trees (not classical B-trees) |
| ext4 | HTree (hash-indexed B-tree variant) for directories |
Common theme: file systems need sorted, balanced indexes for directories and metadata — exactly what B-trees provide. The high fanout keeps directory lookups under a handful of I/Os even for millions of files.
LSM-trees buffer writes in an in-memory memtable, then flush to immutable sorted runs (SSTables) on disk. Periodic compaction merges runs.
| Property | B-Tree | LSM-Tree |
|---|---|---|
| Write pattern | Random (update in place) | Sequential (append + compact) |
| Write amplification | Lower for updates | Higher (compaction rewrites) |
| Write throughput | Moderate | High (sequential I/O) |
| Read latency | Predictable O(logB n) | May check multiple levels |
| Space amplification | ~50–69% utilisation | Multiple copies during compaction |
| Range queries | Excellent (leaf scan) | Good (after compaction) |
Read-heavy OLTP, predictable latency, range scans, general-purpose workloads
Write-heavy workloads, time-series, logging, append-mostly data
RocksDB LevelDB Cassandra HBase
B-trees write an entire page for each modified key. If a page holds 500 keys but only one changed, you still write the full page.
Each internal node has a message buffer. Writes are inserted into the buffer at the root and lazily pushed down.
[30 | 60]
buffer: [ins(45), del(12)]
/ | \
[...] [...] [...]
Messages trickle down over time
| Property | B-Tree | Fractal Tree |
|---|---|---|
| Point query | O(logB n) | O(logB n) |
| Insert | O(logB n) | O(logB n / Bε) |
| Write amplification | Higher | Significantly lower |
Batch writes, sequential flush, periodic compaction. Used by RocksDB, Cassandra, HBase.
Append delta records to pages; consolidate lazily. Used in SQL Server Hekaton and Azure Cosmos DB.
Lazy-adaptive tree combining B-tree and LSM ideas for workload-responsive behaviour.
TokuDB (now archived) used fractal trees. The ideas live on in PerconaFT and influenced modern hybrid designs.
Instead of inserting one by one, build bottom-up:
CREATE INDEX in PostgreSQL and InnoDB use optimised bulk-loading.
| Approach | Description |
|---|---|
| Lock coupling | Lock parent, lock child, release parent |
| B-link trees | Right-link pointers; readers move right without locks during splits |
| Optimistic | Read without locks; validate before write |
| OLFIT | Version counters on nodes detect concurrent modifications |
| Source | Description |
|---|---|
| Cormen et al. | CLRS — Chapter 18: B-Trees |
| Graefe | "Modern B-Tree Techniques" — 2011 survey |
| Kleppmann | DDIA — Chapter 3: Storage and Retrieval |
| Bayer & McCreight | "Organization and Maintenance of Large Ordered Indexes" (1970) |
| Lehman & Yao | "Efficient Locking for Concurrent Operations on B-Trees" (1981) |
| CMU 15-445 | Andy Pavlo's Database Systems — storage & indexing lectures |