COMPUTER SCIENCE FUNDAMENTALS SERIES

B-Trees

B-tree operations · B+ trees · Disk I/O ·
Database indexes · Node splitting · LSM comparison
Mid-level software engineer track  ·  20 slides
02

Motivation — Disk-Based Storage

The memory-disk gap

Main memory is fast but limited. Databases must store far more data than fits in RAM, so most data lives on disk (SSD or HDD).

  • Random disk read: ~100 μs (SSD) or ~10 ms (HDD)
  • Random RAM access: ~100 ns — three to five orders of magnitude faster
  • Sequential disk reads are much cheaper than random — amortise seek cost over large blocks
  • OS reads data in fixed-size pages (typically 4 KB); databases use 8–16 KB pages

Minimising I/O

The dominant cost in disk-based data structures is the number of disk page reads/writes. An ideal index structure:

  • Keeps the tree short so search touches few pages
  • Packs many keys per node so each page read yields maximum information
  • Maintains balance to guarantee worst-case performance
  • Exploits sequential access patterns where possible

Design goal: minimise the number of disk I/Os per operation — not CPU comparisons. This is the fundamental insight behind B-trees.

03

B-Tree Definition

A B-tree of order m (also called minimum degree t where m = 2t) satisfies these properties (Knuth, 1973):

PropertyDescription
RootHas between 1 and m − 1 keys (may have as few as 0 children if it is a leaf)
Internal nodesEach non-root internal node has between ⌈m/2⌉ and m children
Keys per nodeEach node with k children stores exactly k − 1 keys
Sorted keysKeys within each node are in ascending order
BalancedAll leaves appear on the same level
Subtree orderingFor key Ki, all keys in the left subtree < Ki and all keys in the right subtree > Ki

Why these constraints?

  • Minimum fill — nodes at least half full → good space utilisation
  • Balance — O(log n) height → predictable search cost
  • High fanout — shallow tree → fewer disk reads per search

Key insight: A B-tree of order 1000 with 1 billion keys has height ≤ 3. Three disk reads to find any key among a billion.

04

B-Tree Node Structure

Anatomy of a B-tree node

Each node occupies one disk page and contains interleaved keys and child pointers:

+----+----+----+----+----+----+----+----+----+
| P0 | K1 | P1 | K2 | P2 | K3 | P3 |... | Pn |
+----+----+----+----+----+----+----+----+----+

Ki = key i (sorted)
Pi = pointer to child i (disk page address)
  • Keys stored in sorted order within the node
  • Pointers (child page addresses) interleave with keys
  • Leaf nodes have no child pointers (or null pointers)
  • Each node may also store associated data values (B-tree) or just keys (B+ tree)

Typical sizing

ParameterTypical Value
Page size8 KB – 16 KB
Key size8 – 64 bytes
Pointer size6 – 8 bytes
Keys per node100 – 500+
Fanout101 – 501+

High fanout is the key advantage. A binary tree with 1 million keys has height ~20. A B-tree with fanout 500 has height ~3.

05

B-Tree Search

Algorithm

BTREE-SEARCH(node, key):
    i = 0
    while i < node.num_keys
          and key > node.keys[i]:
        i += 1
    if i < node.num_keys
       and key == node.keys[i]:
        return (node, i)        // found
    if node.is_leaf:
        return NOT_FOUND
    DISK-READ(node.children[i]) // 1 I/O
    return BTREE-SEARCH(
        node.children[i], key)

Cost analysis

  • Each level requires one disk read
  • Within a node: binary search over keys — O(log m) comparisons (CPU, cheap)
  • Total disk I/Os: O(logt n)
  • Total CPU comparisons: O(log n)

Example: searching for key 42

30 60 ← read 1 10 20 35 42 55 ← read 2 70 80 90 42 found! Total: 2 disk reads Root → 42 > 30 and 42 < 60 → follow middle pointer → find 42 at position 1
06

B-Tree Insertion

Algorithm overview

  1. Search for the correct leaf node
  2. If the leaf has room, insert the key in sorted position — done
  3. If the leaf is full (m − 1 keys), split the node

Proactive (preemptive) splitting

Instead of splitting bottom-up after overflow, split any full node encountered on the way down during the search:

  • At most one split per level
  • The parent always has room for the promoted key
  • Single-pass insertion — no need to backtrack up the tree

Proactive splitting is the standard approach. It avoids the complexity of bottom-up split propagation and simplifies concurrent access.

Insertion example (order 5)

Insert 25 into a full leaf:

Before:  [10 | 20 | 30 | 40]  ← full

Step 1: Split at median (30)

          [30]         ← promoted
         /    \
   [10|20]  [30|40]

Step 2: Insert 25

          [30]
         /    \
  [10|20|25] [30|40]   ← done

Root splits: When the root is full and must split, a new root is created with the promoted median. This is the only way a B-tree grows taller.

07

Node Splitting in Detail

Split algorithm

SPLIT-CHILD(parent, i):
    full = parent.children[i]  // m-1 keys
    median = full.keys[t-1]
    new = allocate_page()

    // Right half → new node
    new.keys = full.keys[t .. m-1]
    new.children = full.children[t .. m]

    // Left half stays
    full.num_keys = t - 1

    // Promote median to parent
    insert median at parent.keys[i]
    parent.children[i+1] = new

    DISK-WRITE(full)    // 3 writes
    DISK-WRITE(new)     // per split
    DISK-WRITE(parent)

Split cost

OperationCount
Disk reads (search path)O(logt n)
Disk writes (per split)3
Splits per insertion (worst)O(logt n)
Amortised splitsO(1)

In practice, splits are rare. Most insertions just add a key to a non-full leaf — one disk write.

08

B-Tree Deletion

Three cases

CaseSituationAction
1Key found in a leaf nodeRemove the key directly
2Key found in an internal nodeReplace with predecessor (or successor) from a leaf, then delete from leaf
3Key not in current nodeEnsure child has ≥ t keys before recursing (fix underflow proactively)

Fixing underflow before recursion

When descending to a child that has only t − 1 keys (minimum), we must ensure it has at least t keys before entering it:

Borrow from sibling

If an adjacent sibling has ≥ t keys, rotate a key through the parent into the deficient child.

Merge

If both adjacent siblings have exactly t − 1 keys, merge the child with a sibling and pull the separator key down from the parent.

Like insertion, deletion uses proactive fixing — we ensure each node on the path down has enough keys so we never need to backtrack.

09

Deletion — Merging & Borrowing

Borrowing from a sibling

Before (right sibling has spare keys):

  Parent:  [...| 30 |...]
             /       \
  Child: [10|20]  [35|40|50|60]

After rotation:

  Parent:  [...| 35 |...]
             /       \
  Child: [10|20|30] [40|50|60]

Key 30 demoted, key 35 promoted.

Merging two nodes

Before (both have t-1 keys):

  Parent:  [...| 30 |...]
             /       \
  Child: [10|20]   [35|40]

After merge:

  Parent:  [...]     ← 30 pulled down
             |
  Merged: [10|20|30|35|40]

Tree height reduction: if the parent was the root and had only one key, the merged node becomes the new root — the tree shrinks in height. This is the only way a B-tree gets shorter.

10

B+ Trees

In a B+ tree, all data records live in the leaf nodes. Internal nodes store only keys and child pointers — a pure index.

FeatureB-TreeB+ Tree
Data locationAny nodeLeaves only
Leaf linkingNoneDoubly-linked list
Internal node keysSeparator + dataCopies / separators only
FanoutLower (data in internal nodes)Higher (index-only internals)
Range queriesRequires tree traversalFollow leaf pointers
Point queryMay terminate at internal nodeAlways reaches a leaf
30 60 index only 10 20 data 30 35 42 data 60 70 80 data ← → leaf-level linked list for sequential scan
11

B+ Tree Range Queries

Algorithm

RANGE-QUERY(tree, low, high):
    leaf = search for leaf containing low
    results = []
    while leaf is not null:
        for each (key, val) in leaf:
            if key > high:
                return results
            if key >= low:
                results.append((key, val))
        leaf = leaf.next_leaf
    return results

Cost analysis

OperationCost
Find starting leafO(logt n)
Scan matching leavesO(k / b)
TotalO(logt n + k/b)

k = result size, b = keys per leaf

Why this matters

SELECT * FROM orders
WHERE order_date
  BETWEEN '2025-01-01' AND '2025-03-31'
ORDER BY order_date;
  • B+ tree on order_date: find the first matching leaf, then scan sequentially
  • The leaf linked list provides data already sorted — no additional sort step
  • Adjacent leaves are often physically adjacent on disk — sequential I/O

This is why ORDER BY on an indexed column is essentially free when the query uses that index.

12

B* Trees

A B* tree variant where every non-root node must be at least 2/3 full (vs 1/2 in a standard B-tree).

PropertyB-TreeB* Tree
Minimum fill1/22/3
Split strategySplit 1 → 2Redistribute, then split 2 → 3
Space utilisation~50–69%~66–81%
Average fanoutHighEven higher

Two-to-three split

When node overflows and sibling is full:

Before:  [A B C D] [E F G H]  ← both full

Redistribute across three nodes:

After:   [A B C] [D E] [F G H]
         separator keys promoted

Benefits

  • Better space utilisation — fewer disk pages for the same data
  • Higher effective fanout — shallower trees
  • Fewer splits overall

HFS+ (macOS) historically used B* trees for its catalog file.

13

B-Tree Height Analysis

Height bound: for a B-tree of order m (minimum degree t = ⌈m/2⌉) with n keys:   h ≤ logt((n + 1) / 2)

Concrete examples

Keys (n)Order (m)Min degree (t)Max heightDisk reads
1,0001005022
1,000,0001005044
1,000,000,0001005066
1,000,00050025033
1,000,000,00050025044
1,000,000,000100050044

B-tree vs binary tree

Balanced BST with 109 keys: height ~30 → 30 disk reads.
B-tree order 1000 with 109 keys: height ~3–4 → 3–4 disk reads.

In practice

The root node is almost always cached in memory. So a search in a billion-key B-tree often requires only 2–3 disk reads.

14

Disk I/O Model & Why Fanout Matters

The external memory model

  • Memory holds M elements
  • Disk transfers data in blocks of B elements
  • Cost metric: number of block transfers (I/Os)

Fanout and tree height

Fanout f = page_size / (key + ptr)

Example: 16 KB page, 8B keys, 8B ptrs
f = 16384 / 16 = 1024

Height for n = 10^9:
h = ceil(log_1024(10^9))
  = ceil(3.0) = 3

I/O costs comparison

StructureSearchInsertRange (k)
Sorted arrayO(logB n)O(n/B)O(logB n + k/B)
Binary treeO(log n)O(log n)O(log n + k)
B-tree / B+O(logB n)O(logB n)O(logB n + k/B)
Hash tableO(1)O(1)O(n/B)

B-trees achieve optimal I/O for comparison-based search. Only hashing has better point-query I/O — but hashing cannot support range queries or ordering.

15

B-Trees in Databases

InnoDB (MySQL)

  • Default storage engine since MySQL 5.5
  • Clustered index: the primary key B+ tree stores actual row data in leaves — the table is the index
  • Secondary indexes store the PK value in leaves; lookup requires two B+ tree traversals
  • Page size: 16 KB; typical fanout: ~500–1200

PostgreSQL

  • Heap-organised tables (not clustered by default)
  • B-tree is the default index type
  • Uses Lehman-Yao B-trees with right-link pointers for lock-free concurrent reads
  • Page size: 8 KB; typical fanout: ~200–500
  • Supports INCLUDE columns for covering indexes

Other databases

DatabaseB-Tree variantNotes
SQLiteB+ treeBoth table and index structures
OracleB+ treeIndex-organised tables (IOT) option
SQL ServerB+ treeClustered and non-clustered indexes
MongoDBB+ treeWiredTiger engine, default since 3.2

Clustered vs heap: InnoDB's clustered index means PK range scans are sequential I/O. PostgreSQL's heap means any index scan may require random I/O to fetch heap tuples — mitigated by bitmap scans.

16

B-Trees in File Systems

NTFS (Windows)

  • Master File Table uses B+ trees for directory indexing
  • File names sorted in B+ tree leaves for fast lookups
  • Large directories with thousands of entries remain fast

HFS+ (macOS, legacy)

  • Catalog file is a B* tree mapping file IDs to metadata
  • Extents overflow file is also a B* tree
  • Replaced by APFS in 2017

Btrfs (Linux)

  • "B-tree file system" — name is literal
  • Copy-on-write B-trees: file extents, directory entries, checksums, snapshots
  • COW enables atomic snapshots and transactions

Other file systems

File SystemB-Tree usage
XFSB+ trees for directory entries, free space, inodes, extents
ReiserFSB+ tree (S+ tree) for all metadata and small file data
ZFSDMU with indirect block trees (not classical B-trees)
ext4HTree (hash-indexed B-tree variant) for directories

Common theme: file systems need sorted, balanced indexes for directories and metadata — exactly what B-trees provide. The high fanout keeps directory lookups under a handful of I/Os even for millions of files.

17

LSM-Trees vs B-Trees

LSM-trees buffer writes in an in-memory memtable, then flush to immutable sorted runs (SSTables) on disk. Periodic compaction merges runs.

PropertyB-TreeLSM-Tree
Write patternRandom (update in place)Sequential (append + compact)
Write amplificationLower for updatesHigher (compaction rewrites)
Write throughputModerateHigh (sequential I/O)
Read latencyPredictable O(logB n)May check multiple levels
Space amplification~50–69% utilisationMultiple copies during compaction
Range queriesExcellent (leaf scan)Good (after compaction)

Prefer B-trees

Read-heavy OLTP, predictable latency, range scans, general-purpose workloads

Prefer LSM-trees

Write-heavy workloads, time-series, logging, append-mostly data

LSM engines

RocksDB LevelDB Cassandra HBase

18

Fractal Trees & Write-Optimised Structures

The write amplification problem

B-trees write an entire page for each modified key. If a page holds 500 keys but only one changed, you still write the full page.

Fractal trees (Buffered B-trees)

Each internal node has a message buffer. Writes are inserted into the buffer at the root and lazily pushed down.

     [30 | 60]
     buffer: [ins(45), del(12)]
    /    |    \
 [...]  [...] [...]
 Messages trickle down over time
PropertyB-TreeFractal Tree
Point queryO(logB n)O(logB n)
InsertO(logB n)O(logB n / Bε)
Write amplificationHigherSignificantly lower

Other write-optimised structures

LSM-tree

Batch writes, sequential flush, periodic compaction. Used by RocksDB, Cassandra, HBase.

Bw-tree

Append delta records to pages; consolidate lazily. Used in SQL Server Hekaton and Azure Cosmos DB.

LA-tree

Lazy-adaptive tree combining B-tree and LSM ideas for workload-responsive behaviour.

TokuDB (now archived) used fractal trees. The ideas live on in PerconaFT and influenced modern hybrid designs.

19

Practical Considerations

Choosing page size

  • Larger pages = higher fanout = shallower tree = fewer I/Os
  • Larger pages = more wasted space if nodes are sparse
  • SSD: 4–16 KB pages work well
  • HDD: larger pages amortise seek latency

Bulk loading

Instead of inserting one by one, build bottom-up:

  1. Sort all keys
  2. Fill leaf pages sequentially (100% full)
  3. Build internal pages from leaf separators
  4. Result: compact tree, one sequential write pass

CREATE INDEX in PostgreSQL and InnoDB use optimised bulk-loading.

Concurrency control

ApproachDescription
Lock couplingLock parent, lock child, release parent
B-link treesRight-link pointers; readers move right without locks during splits
OptimisticRead without locks; validate before write
OLFITVersion counters on nodes detect concurrent modifications
20

Summary & Further Reading

Key takeaways

  • B-trees are the dominant on-disk index structure because they minimise disk I/O through high fanout and guaranteed balance
  • Order-1000 B-tree can index a billion keys in 3–4 levels
  • B+ trees keep all data in leaves with a linked list — range queries and sequential scans are fast
  • Proactive splitting/fix-up enable single-pass, top-down algorithms
  • B-trees power indexes in virtually every relational database and many file systems
  • LSM-trees trade read performance for write throughput; fractal trees reduce write amplification
  • The right choice depends on workload: read-heavy → B-trees, write-heavy → LSM-trees

Recommended reading

SourceDescription
Cormen et al.CLRS — Chapter 18: B-Trees
Graefe"Modern B-Tree Techniques" — 2011 survey
KleppmannDDIA — Chapter 3: Storage and Retrieval
Bayer & McCreight"Organization and Maintenance of Large Ordered Indexes" (1970)
Lehman & Yao"Efficient Locking for Concurrent Operations on B-Trees" (1981)
CMU 15-445Andy Pavlo's Database Systems — storage & indexing lectures