On a 13-stage pipe with 4-wide fetch, a mispredict costs ~13 bubbles × 4 = 52 lost slots. A 95% accurate predictor on a branchy workload means 5% × 52 = 2.6 lost slots per branch — so predictor quality is often the #1 lever for big-core IPC.
TAGE in 60s
Multiple "tagged geometric" history-length tables index by a hash of (PC, long branch history). Longest table that hits wins. Short-history & bi-modal fall-back tables cover easier branches. Excellent at irregular patterns.
05
big.LITTLE Origins (2011-2017)
2011: Cortex-A15 + A7 pairing — first big.LITTLE. A15 was a power-hog OoO, A7 a tiny in-order.
Coherency glue: CCI-400 (ACE-based Cache Coherent Interconnect) kept the two clusters' L2 caches coherent when threads migrated.
Three operation modes:
Cluster Migration (CM) — all cores in one cluster at a time
CPU Migration (IKS) — pair each A15 with an A7, only one runs
Global Task Scheduling (GTS) — Linux sees all cores; scheduler migrates freely
Problems: migration latency (>100 µs), OS scheduler bugs, two fixed clusters ⇒ rigid SKUs. iOS never adopted big.LITTLE — Apple went straight to a single-cluster heterogeneous design.
big + LITTLE pair
Year
A15 + A7
2011-14
A57 + A53
2014-16
A72 + A53
2015-16
A73 + A53
2016-17
Why this had to end
Phones wanted 3 or 4 performance tiers (ultra-big / big / mid / little), not 2. Two fixed clusters don't scale. Hence: DynamIQ.
06
DynamIQ Shared Unit (DSU) — 2017 onwards
Launched with Cortex-A75 / A55 in 2017. Replaces big.LITTLE's two-cluster model.
One cluster containing up to 8 (DSU-110) or 14 (DSU-120) cores of mixed type.
Integrates:
Snoop filter + SCU (coherence)
L3 cache — up to 16 MB shared across the cluster
Asynchronous bridges to the SoC fabric (CMN mesh, NIC, GIC)
Per-core DVFS domains — each core has its own clock + voltage
Typical 2024 flagship cluster: 1 × Cortex-X925 + 3 × A725 + 4 × A520 on a DSU-120.
Latency between any two cores' L1 is one DSU snoop — dramatically lower than CCI-400 inter-cluster latency.
DSU version
Year
Max cores
Max L3
Shipping in
DSU (v0)
2017
8
4 MB
Kirin 970, Snapdragon 845
DSU-110
2021
8
16 MB
Snapdragon 8 Gen 1/2
DSU-120
2023
14
32 MB
Dimensity 9300, Cortex-X4 platform
DSU-120AE
2024
14
32 MB
Automotive — ASIL-B/D
CHI egress
DSU speaks AMBA 5 CHI out to the CMN mesh / NIC-700. That's the clean interface to the rest of the SoC.
07
Thermal & Power — Why 1+3+4
Modern phone SoCs have ~5-7 W sustained power budgets. Bursts to 12 W cause thermal throttling within ~30 s.
Benchmarks (Geekbench, Antutu) run single-thread peaks → X-cores are for < 1-second bursts.
Background work (notifications, network, music) runs on little cores at 1-2× lower W/perf.
The schedule mapping happens through Android EAS (Energy Aware Scheduling) which models each core's capacity and energy curve.
Thermal headroom is what the next-generation wars are being fought over. Dimensity 9300 dropped little cores entirely (3 × X4 + 4 × A720) to maximise peak; criticised for sustained battery life.
DVFS = voltage / frequency scaling
Per-core DVFS lets the SoC run one X at 3.8 GHz while four A520s sit at 1.0 GHz. Huge energy win versus the old "all cores same freq" models.
AMU — Activity Monitors (v8.4)
Architectural counters the OS can read to estimate perf/watt without proprietary telemetry. Feeds Linux utilclamp and EAS decisions.
v8.6 adds MPMM (Multi-PMU virtualization) so a guest can have its own PMU view.
09
SPE — Statistical Profiling Extension (v8.2)
PMU counts events in aggregate. SPE samples individual executed operations — like Intel PEBS, AMD IBS.
Every Nth retired µop writes a sample record to a dedicated memory buffer:
Instruction PC + virtual address of any memory reference
Latency of that operation
Outcome (hit / miss / TLB miss / fault / mispredict)
Enables:
Accurate per-instruction attribution of cache misses, mispredicts, TLB misses
Cycle counting histograms per source line
Data-reference profiling for memory layout optimisation
Linux perf record -e arm_spe// harvests the buffer.
Why this matters in HPC/ML
Knowing "this kernel has 5% L3 miss rate" is nearly useless. Knowing "x[i][j] at matmul.c:47 causes 90% of L3 misses and costs 200 cycles each" is actionable. SPE closes that gap on Arm.
ETE / TRBE (v8.4 / v9)
Successor to ETM — stores program trace in a TRBE buffer in DRAM. Real-time, non-invasive. Used for post-mortem kernel crash analysis.
10
L1 / L2 / L3 Cache Sizes Over Generations
Core
L1-I/D
L2 (private)
L3 (DSU)
A8
32 KB / 32 KB
0-1 MB shared
-
A15
32 KB / 32 KB
cluster-shared, up to 4 MB
-
A57
48 KB / 32 KB
cluster-shared 1-2 MB
-
A72
48 KB / 32 KB
cluster-shared 1-2 MB
-
A76
64 KB / 64 KB
256 KB-512 KB private
up to 4 MB
A78
64 KB / 64 KB
256 KB-512 KB
up to 8 MB
X1
64 KB / 64 KB
up to 1 MB
up to 8 MB
X4 / A720
64 KB / 64 KB
2 MB / 512 KB-1 MB
up to 32 MB (DSU-120)
X925 / A725
64 KB / 64 KB
3 MB / 1 MB
up to 32 MB
Two big shifts: (1) 2018 — L2 became private, matched with introduction of the shared L3 in DSU. (2) 2023 — L2 exploded to 2-3 MB per X-core, approaching laptop-class sizes.
11
Load/Store Unit — the Quiet Bottleneck
Modern big Cortex-A: 2 load + 2 store pipes (X1+ has 3 load).
Store buffer: 40-60 entries. Coalesces consecutive stores to same line before committing to L1.
Memory dependency prediction — speculates whether a load depends on an earlier store in the queue. Mispredict → replay the load.
DSB stagger — when a real DSB drains the store queue, it blocks the pipeline for tens of cycles. One reason release/acquire is preferred.
Atomic fast-path: LSE atomics (CAS, LDADD) bypass the store queue and go straight to the cache controller with a near-atomic request.
DC ZVA — the allocate-without-read trick
Allocates a cache line in L1 without an initial fill from DRAM; then writes zero into the whole line. Used in memset(0), calloc, page zeroing. Often 3-5× faster than a STR #0 loop.
STP pair-ops
STP Xn, Xm, [sp,#-16]! is a single µop that writes 16 bytes. Why every function prologue/epilogue on AArch64 uses STP/LDP — double the store width.
Sharing microarchitecture between Cortex-A and Neoverse amortises R&D across mobile + server markets — Arm ships > 100 M phones per quarter and > 10 M server cores per year off the same design dollars.
Custom licensees
Apple, Qualcomm Nuvia/Oryon, Ampere AmpereOne, and AWS Graviton 4 all run the Arm ISA but implement their own microarchitecture. None of these use Cortex-A RTL.
14
Series Finale — Putting It All Together
Across the 6 decks:
01 History — why the family looks the way it does
02 Architecture — ELs, A64, system regs
03 Memory — VMSA, caches, ordering, atomics
04 Vectors — NEON → SVE → SVE2 → SME
05 Security — TrustZone, EL2, PAC/BTI/MTE, CCA
06 Microarchitecture — OoO, DynamIQ, PMU
Together they cover what a phone/SoC/kernel Arm engineer is expected to know — and what a Neoverse engineer needs before deck 01 of that series.
What I'd study next
AMBA series — everything below the core (AXI, ACE, CHI)
Neoverse series — same cores, server deployment
Arm System IP series — GIC, SMMU, DSU, MPAM
Modern SoC Design series — packaging, chiplets, SerDes, CXL
15
Lessons
"Why did A76 shorten the pipeline vs A72?" → higher IPC + lower mispredict cost + 7 nm process allowed physical designs with fewer stages at same clock.
"Explain DynamIQ vs big.LITTLE" → one cluster, many core types, shared L3, per-core DVFS; vs two fixed clusters, separate L2s, CCI-400 inter-cluster coherence.
"Why is RSB important?" → returns are indirect but deterministic — a small RSB predicts them perfectly. Attackers have exploited RSB underflows for Spectre-RSB variants.
"What's the difference between PMU and SPE?" → PMU counts events in aggregate; SPE samples individual operations with per-op latency & attribution.
"Why 1+3+4?" → thermal + benchmark + efficiency blend. 1 X for burst benchmarks, 3 mid for sustained, 4 little for background — fits a 5-7 W budget.
"Why does Arm keep making L3 bigger?" → DRAM bandwidth per core didn't grow at Moore rate. Larger L3 hides more misses. X925 + DSU-120 pushes toward laptop-sized LLCs.
"DSU vs CMN?" → DSU is a cluster-level shared L3 + snoop filter; CMN is a system-level mesh that connects multiple DSUs + I/O + memory controllers.
SVE2 as default vectors — NEON in compat mode only.
SME on flagships — matrix units as standard for on-device LLM.
Memory tagging universal — MTE async in every Armv9-A shipping kernel.
Laptop-class Cortex-A — Windows-on-Arm + Android XR + Chromebook share the same X-core die budgets.
CCA on phones — protected media + on-device AI inference in Realm world.
The bar for an Arm hire
A serious Arm microarchitecture review covers: branch prediction, OoO recovery, memory ordering, cache coherence, PMU analysis, power/thermal modelling, and at least awareness of PAC/BTI/MTE/CCA. The six decks in this series are the map.
17
References
Arm Ltd. — per-core Technical Reference Manuals (A76, A78, X1, X3, X4, A520, A725, X925) — publicly available Arm Ltd. — DynamIQ Shared Unit TRM (DSU-110, DSU-120) — cache hierarchy, coherence, CHI egress Arm Ltd. — Arm PMU Architecture Reference Manual (DDI 0601) — PMU + SPE + AMU Shen & Lipasti — Modern Processor Design (McGraw-Hill, 2013) — OoO pipeline fundamentals Seznec, A. — "A case for (partially) TAGged geometric history length branch prediction" (ISCA 2006) Chipsandcheese.com — microarchitecture deep-dives on Neoverse V2, Cortex-X4, Apple M-series AnandTech — Andrei Frumusanu archive (2015-2023) — per-release benchmark + IPC analysis Hot Chips / ISSCC proceedings — annual Cortex-X and Neoverse architecture talks Linux kernel — arch/arm64/ — canonical reader of every A-profile PMU / cache / memory feature
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.