COMPUTER SCIENCE FUNDAMENTALS SERIES

B-Trees

B-tree operations · B+ trees · Disk I/O ·
Database indexes · Node splitting · LSM comparison

Mid-level software engineer track · 20 slides

Motivation — Disk-Based Storage

The memory-disk gap

Main memory is fast but limited. Databases must store far more data than fits in RAM, so most data lives on disk (SSD or HDD).

Random disk read: ~100 μs (SSD) or ~10 ms (HDD)
Random RAM access: ~100 ns — three to five orders of magnitude faster
Sequential disk reads are much cheaper than random — amortise seek cost over large blocks
OS reads data in fixed-size pages (typically 4 KB); databases use 8–16 KB pages

Minimising I/O

The dominant cost in disk-based data structures is the number of disk page reads/writes. An ideal index structure:

Keeps the tree short so search touches few pages
Packs many keys per node so each page read yields maximum information
Maintains balance to guarantee worst-case performance
Exploits sequential access patterns where possible

Design goal: minimise the number of disk I/Os per operation — not CPU comparisons. This is the fundamental insight behind B-trees.

B-Tree Definition

A B-tree of order m (also called minimum degree t where m = 2t) satisfies these properties (Knuth, 1973):

Property	Description
Root	Has between 1 and m − 1 keys (may have as few as 0 children if it is a leaf)
Internal nodes	Each non-root internal node has between ⌈m/2⌉ and m children
Keys per node	Each node with k children stores exactly k − 1 keys
Sorted keys	Keys within each node are in ascending order
Balanced	All leaves appear on the same level
Subtree ordering	For key K_i, all keys in the left subtree < K_i and all keys in the right subtree > K_i

Why these constraints?

Minimum fill — nodes at least half full → good space utilisation
Balance — O(log n) height → predictable search cost
High fanout — shallow tree → fewer disk reads per search

Key insight: A B-tree of order 1000 with 1 billion keys has height ≤ 3. Three disk reads to find any key among a billion.

B-Tree Node Structure

Anatomy of a B-tree node

Each node occupies one disk page and contains interleaved keys and child pointers:

+----+----+----+----+----+----+----+----+----+
| P0 | K1 | P1 | K2 | P2 | K3 | P3 |... | Pn |
+----+----+----+----+----+----+----+----+----+

Ki = key i (sorted)
Pi = pointer to child i (disk page address)

Keys stored in sorted order within the node
Pointers (child page addresses) interleave with keys
Leaf nodes have no child pointers (or null pointers)
Each node may also store associated data values (B-tree) or just keys (B+ tree)

Typical sizing

Parameter	Typical Value
Page size	8 KB – 16 KB
Key size	8 – 64 bytes
Pointer size	6 – 8 bytes
Keys per node	100 – 500+
Fanout	101 – 501+

High fanout is the key advantage. A binary tree with 1 million keys has height ~20. A B-tree with fanout 500 has height ~3.

B-Tree Search

Algorithm

BTREE-SEARCH(node, key):
    i = 0
    while i < node.num_keys
          and key > node.keys[i]:
        i += 1
    if i < node.num_keys
       and key == node.keys[i]:
        return (node, i)        // found
    if node.is_leaf:
        return NOT_FOUND
    DISK-READ(node.children[i]) // 1 I/O
    return BTREE-SEARCH(
        node.children[i], key)

Cost analysis

Each level requires one disk read
Within a node: binary search over keys — O(log m) comparisons (CPU, cheap)
Total disk I/Os: O(log_t n)
Total CPU comparisons: O(log n)

Example: searching for key 42

B-Tree Insertion

Algorithm overview

Search for the correct leaf node
If the leaf has room, insert the key in sorted position — done
If the leaf is full (m − 1 keys), split the node

Proactive (preemptive) splitting

Instead of splitting bottom-up after overflow, split any full node encountered on the way down during the search:

At most one split per level
The parent always has room for the promoted key
Single-pass insertion — no need to backtrack up the tree

Proactive splitting is the standard approach. It avoids the complexity of bottom-up split propagation and simplifies concurrent access.

Insertion example (order 5)

Insert 25 into a full leaf:

Before:  [10 | 20 | 30 | 40]  ← full

Step 1: Split at median (30)

          [30]         ← promoted
         /    \
   [10|20]  [30|40]

Step 2: Insert 25

          [30]
         /    \
  [10|20|25] [30|40]   ← done

Root splits: When the root is full and must split, a new root is created with the promoted median. This is the only way a B-tree grows taller.

Node Splitting in Detail

Split algorithm

SPLIT-CHILD(parent, i):
    full = parent.children[i]  // m-1 keys
    median = full.keys[t-1]
    new = allocate_page()

    // Right half → new node
    new.keys = full.keys[t .. m-1]
    new.children = full.children[t .. m]

    // Left half stays
    full.num_keys = t - 1

    // Promote median to parent
    insert median at parent.keys[i]
    parent.children[i+1] = new

    DISK-WRITE(full)    // 3 writes
    DISK-WRITE(new)     // per split
    DISK-WRITE(parent)

Split cost

Operation	Count
Disk reads (search path)	O(log_t n)
Disk writes (per split)	3
Splits per insertion (worst)	O(log_t n)
Amortised splits	O(1)

In practice, splits are rare. Most insertions just add a key to a non-full leaf — one disk write.

B-Tree Deletion

Three cases

Case	Situation	Action
1	Key found in a leaf node	Remove the key directly
2	Key found in an internal node	Replace with predecessor (or successor) from a leaf, then delete from leaf
3	Key not in current node	Ensure child has ≥ t keys before recursing (fix underflow proactively)

Fixing underflow before recursion

When descending to a child that has only t − 1 keys (minimum), we must ensure it has at least t keys before entering it:

Borrow from sibling

If an adjacent sibling has ≥ t keys, rotate a key through the parent into the deficient child.

Merge

If both adjacent siblings have exactly t − 1 keys, merge the child with a sibling and pull the separator key down from the parent.

Like insertion, deletion uses proactive fixing — we ensure each node on the path down has enough keys so we never need to backtrack.

Deletion — Merging & Borrowing

Borrowing from a sibling

Before (right sibling has spare keys):

  Parent:  [...| 30 |...]
             /       \
  Child: [10|20]  [35|40|50|60]

After rotation:

  Parent:  [...| 35 |...]
             /       \
  Child: [10|20|30] [40|50|60]

Key 30 demoted, key 35 promoted.

Merging two nodes

Before (both have t-1 keys):

  Parent:  [...| 30 |...]
             /       \
  Child: [10|20]   [35|40]

After merge:

  Parent:  [...]     ← 30 pulled down
             |
  Merged: [10|20|30|35|40]

Tree height reduction: if the parent was the root and had only one key, the merged node becomes the new root — the tree shrinks in height. This is the only way a B-tree gets shorter.

B+ Trees

In a B+ tree, all data records live in the leaf nodes. Internal nodes store only keys and child pointers — a pure index.

Feature	B-Tree	B+ Tree
Data location	Any node	Leaves only
Leaf linking	None	Doubly-linked list
Internal node keys	Separator + data	Copies / separators only
Fanout	Lower (data in internal nodes)	Higher (index-only internals)
Range queries	Requires tree traversal	Follow leaf pointers
Point query	May terminate at internal node	Always reaches a leaf

B+ Tree Range Queries

Algorithm

RANGE-QUERY(tree, low, high):
    leaf = search for leaf containing low
    results = []
    while leaf is not null:
        for each (key, val) in leaf:
            if key > high:
                return results
            if key >= low:
                results.append((key, val))
        leaf = leaf.next_leaf
    return results

Cost analysis

Operation	Cost
Find starting leaf	O(log_t n)
Scan matching leaves	O(k / b)
Total	O(log_t n + k/b)

k = result size, b = keys per leaf

Why this matters

SELECT * FROM orders
WHERE order_date
  BETWEEN '2025-01-01' AND '2025-03-31'
ORDER BY order_date;

B+ tree on order_date: find the first matching leaf, then scan sequentially
The leaf linked list provides data already sorted — no additional sort step
Adjacent leaves are often physically adjacent on disk — sequential I/O

This is why ORDER BY on an indexed column is essentially free when the query uses that index.

B* Trees

A B* tree variant where every non-root node must be at least 2/3 full (vs 1/2 in a standard B-tree).

Property	B-Tree	B* Tree
Minimum fill	1/2	2/3
Split strategy	Split 1 → 2	Redistribute, then split 2 → 3
Space utilisation	~50–69%	~66–81%
Average fanout	High	Even higher

Two-to-three split

When node overflows and sibling is full:

Before:  [A B C D] [E F G H]  ← both full

Redistribute across three nodes:

After:   [A B C] [D E] [F G H]
         separator keys promoted

Benefits

Better space utilisation — fewer disk pages for the same data
Higher effective fanout — shallower trees
Fewer splits overall

HFS+ (macOS) historically used B* trees for its catalog file.

B-Tree Height Analysis

Height bound: for a B-tree of order m (minimum degree t = ⌈m/2⌉) with n keys: h ≤ log_t((n + 1) / 2)

Concrete examples

Keys (n)	Order (m)	Min degree (t)	Max height	Disk reads
1,000	100	50	2	2
1,000,000	100	50	4	4
1,000,000,000	100	50	6	6
1,000,000	500	250	3	3
1,000,000,000	500	250	4	4
1,000,000,000	1000	500	4	4

B-tree vs binary tree

Balanced BST with 10⁹ keys: height ~30 → 30 disk reads.
B-tree order 1000 with 10⁹ keys: height ~3–4 → 3–4 disk reads.

In practice

The root node is almost always cached in memory. So a search in a billion-key B-tree often requires only 2–3 disk reads.

Disk I/O Model & Why Fanout Matters

The external memory model

Memory holds M elements
Disk transfers data in blocks of B elements
Cost metric: number of block transfers (I/Os)

Fanout and tree height

Fanout f = page_size / (key + ptr)

Example: 16 KB page, 8B keys, 8B ptrs
f = 16384 / 16 = 1024

Height for n = 10^9:
h = ceil(log_1024(10^9))
  = ceil(3.0) = 3

I/O costs comparison

Structure	Search	Insert	Range (k)
Sorted array	O(log_B n)	O(n/B)	O(log_B n + k/B)
Binary tree	O(log n)	O(log n)	O(log n + k)
B-tree / B+	O(log_B n)	O(log_B n)	O(log_B n + k/B)
Hash table	O(1)	O(1)	O(n/B)

B-trees achieve optimal I/O for comparison-based search. Only hashing has better point-query I/O — but hashing cannot support range queries or ordering.

B-Trees in Databases

InnoDB (MySQL)

Default storage engine since MySQL 5.5
Clustered index: the primary key B+ tree stores actual row data in leaves — the table is the index
Secondary indexes store the PK value in leaves; lookup requires two B+ tree traversals
Page size: 16 KB; typical fanout: ~500–1200

PostgreSQL

Heap-organised tables (not clustered by default)
B-tree is the default index type
Uses Lehman-Yao B-trees with right-link pointers for lock-free concurrent reads
Page size: 8 KB; typical fanout: ~200–500
Supports INCLUDE columns for covering indexes

Other databases

Database	B-Tree variant	Notes
SQLite	B+ tree	Both table and index structures
Oracle	B+ tree	Index-organised tables (IOT) option
SQL Server	B+ tree	Clustered and non-clustered indexes
MongoDB	B+ tree	WiredTiger engine, default since 3.2

Clustered vs heap: InnoDB's clustered index means PK range scans are sequential I/O. PostgreSQL's heap means any index scan may require random I/O to fetch heap tuples — mitigated by bitmap scans.

B-Trees in File Systems

NTFS (Windows)

Master File Table uses B+ trees for directory indexing
File names sorted in B+ tree leaves for fast lookups
Large directories with thousands of entries remain fast

HFS+ (macOS, legacy)

Catalog file is a B* tree mapping file IDs to metadata
Extents overflow file is also a B* tree
Replaced by APFS in 2017

Btrfs (Linux)

"B-tree file system" — name is literal
Copy-on-write B-trees: file extents, directory entries, checksums, snapshots
COW enables atomic snapshots and transactions

Other file systems

File System	B-Tree usage
XFS	B+ trees for directory entries, free space, inodes, extents
ReiserFS	B+ tree (S+ tree) for all metadata and small file data
ZFS	DMU with indirect block trees (not classical B-trees)
ext4	HTree (hash-indexed B-tree variant) for directories

Common theme: file systems need sorted, balanced indexes for directories and metadata — exactly what B-trees provide. The high fanout keeps directory lookups under a handful of I/Os even for millions of files.

LSM-Trees vs B-Trees

LSM-trees buffer writes in an in-memory memtable, then flush to immutable sorted runs (SSTables) on disk. Periodic compaction merges runs.

Property	B-Tree	LSM-Tree
Write pattern	Random (update in place)	Sequential (append + compact)
Write amplification	Lower for updates	Higher (compaction rewrites)
Write throughput	Moderate	High (sequential I/O)
Read latency	Predictable O(log_B n)	May check multiple levels
Space amplification	~50–69% utilisation	Multiple copies during compaction
Range queries	Excellent (leaf scan)	Good (after compaction)

Prefer B-trees

Read-heavy OLTP, predictable latency, range scans, general-purpose workloads

Prefer LSM-trees

Write-heavy workloads, time-series, logging, append-mostly data

LSM engines

RocksDB LevelDB Cassandra HBase

Fractal Trees & Write-Optimised Structures

The write amplification problem

B-trees write an entire page for each modified key. If a page holds 500 keys but only one changed, you still write the full page.

Fractal trees (Buffered B-trees)

Each internal node has a message buffer. Writes are inserted into the buffer at the root and lazily pushed down.

     [30 | 60]
     buffer: [ins(45), del(12)]
    /    |    \
 [...]  [...] [...]
 Messages trickle down over time

Property	B-Tree	Fractal Tree
Point query	O(log_B n)	O(log_B n)
Insert	O(log_B n)	O(log_B n / B^ε)
Write amplification	Higher	Significantly lower

Other write-optimised structures

LSM-tree

Batch writes, sequential flush, periodic compaction. Used by RocksDB, Cassandra, HBase.

Bw-tree

Append delta records to pages; consolidate lazily. Used in SQL Server Hekaton and Azure Cosmos DB.

LA-tree

Lazy-adaptive tree combining B-tree and LSM ideas for workload-responsive behaviour.

TokuDB (now archived) used fractal trees. The ideas live on in PerconaFT and influenced modern hybrid designs.

Practical Considerations

Choosing page size

Larger pages = higher fanout = shallower tree = fewer I/Os
Larger pages = more wasted space if nodes are sparse
SSD: 4–16 KB pages work well
HDD: larger pages amortise seek latency

Bulk loading

Instead of inserting one by one, build bottom-up:

Sort all keys
Fill leaf pages sequentially (100% full)
Build internal pages from leaf separators
Result: compact tree, one sequential write pass

CREATE INDEX in PostgreSQL and InnoDB use optimised bulk-loading.

Concurrency control

Approach	Description
Lock coupling	Lock parent, lock child, release parent
B-link trees	Right-link pointers; readers move right without locks during splits
Optimistic	Read without locks; validate before write
OLFIT	Version counters on nodes detect concurrent modifications

Summary & Further Reading

Key takeaways

B-trees are the dominant on-disk index structure because they minimise disk I/O through high fanout and guaranteed balance
Order-1000 B-tree can index a billion keys in 3–4 levels
B+ trees keep all data in leaves with a linked list — range queries and sequential scans are fast
Proactive splitting/fix-up enable single-pass, top-down algorithms
B-trees power indexes in virtually every relational database and many file systems
LSM-trees trade read performance for write throughput; fractal trees reduce write amplification
The right choice depends on workload: read-heavy → B-trees, write-heavy → LSM-trees

Source	Description
Cormen et al.	CLRS — Chapter 18: B-Trees
Graefe	"Modern B-Tree Techniques" — 2011 survey
Kleppmann	DDIA — Chapter 3: Storage and Retrieval
Bayer & McCreight	"Organization and Maintenance of Large Ordered Indexes" (1970)
Lehman & Yao	"Efficient Locking for Concurrent Operations on B-Trees" (1981)
CMU 15-445	Andy Pavlo's Database Systems — storage & indexing lectures

B-Trees

Motivation — Disk-Based Storage

The memory-disk gap

Minimising I/O

B-Tree Definition

Why these constraints?

B-Tree Node Structure

Anatomy of a B-tree node

Typical sizing

B-Tree Search

Algorithm

Cost analysis

Example: searching for key 42

B-Tree Insertion

Algorithm overview

Proactive (preemptive) splitting

Insertion example (order 5)

Node Splitting in Detail

Split algorithm

Split cost

B-Tree Deletion

Three cases

Fixing underflow before recursion

Borrow from sibling

Merge

Deletion — Merging & Borrowing

Borrowing from a sibling

Merging two nodes

B+ Trees

B+ Tree Range Queries

Algorithm

Cost analysis

Why this matters

B* Trees

Two-to-three split

Benefits

B-Tree Height Analysis

Concrete examples

B-tree vs binary tree

In practice

Disk I/O Model & Why Fanout Matters

The external memory model

Fanout and tree height

I/O costs comparison

B-Trees in Databases

InnoDB (MySQL)

PostgreSQL

Other databases

B-Trees in File Systems

NTFS (Windows)

HFS+ (macOS, legacy)

Btrfs (Linux)

Other file systems

LSM-Trees vs B-Trees

Prefer B-trees

Prefer LSM-trees

LSM engines

Fractal Trees & Write-Optimised Structures

The write amplification problem

Fractal trees (Buffered B-trees)

Other write-optimised structures

LSM-tree

Bw-tree

LA-tree

Practical Considerations

Choosing page size

Bulk loading

Concurrency control

Summary & Further Reading

Key takeaways

Recommended reading