ARM CORTEX-A · PRESENTATION 06

Microarchitecture — OoO, DynamIQ, PMU

From A8's in-order pipeline to X925's 10-wide OoO beast

Fetch · Decode · Rename · ROB · Issue · L/S · Branch Prediction · DynamIQ DSU · PMU · SPE · TRBE

From In-Order to Out-of-Order

Core	Decode	Issue	Stages	Comment
Cortex-A8	2	2 in-order	13	Baseline, no OoO
Cortex-A9	2	4 OoO (partial)	8-11	First Arm with register rename
Cortex-A15	3	8 OoO	15	First "big" OoO
Cortex-A57	3	8 OoO	15-17	First AArch64 big core
Cortex-A72	3	8 OoO	15	Refined A57, lower power
Cortex-A76	4	8 OoO	13	Shorter pipeline, higher IPC
Cortex-A78	4	8 OoO	13	Power-tuned A77
Cortex-X1	5	8 OoO	13	First "Custom Core"; wider front-end
Cortex-X4	10 dispatch	10+ OoO	~15	Largest ROB to date in X3 (~384 entries)
Cortex-X925	10	10+ OoO	15+	2024 "Blackhawk" — flagship width

Two trends over 20 years: wider front-ends (2 → 10 decode) and shorter pipelines (A15's 15 stages → A76's 13). Width grows IPC; shorter pipe reduces mis-predict penalty + improves branchy code.

Canonical OoO Pipeline (Cortex-A78 / X1 class)

The "big" Cortex-A pattern: 4-wide fetch/decode/rename, 8-wide issue across heterogeneous EUs, 2-3 load + store pipes, in-order commit. X-class widens this to 10-wide decode + 10+ issue.

Branch Prediction — The Other Half of IPC

BTB (Branch Target Buffer) — direct branches. Modern cores: 2-level, 2k-4k entries L0 + 16k+ entries L1.
Indirect predictor — for BR Xn. Smaller (hundreds of entries) but often path-hashed for vtable dispatch.
RSB (Return Stack Buffer) — predicts return targets. 16 entries typical. Overflows to BTB.
Conditional predictor — evolved from 2-bit counters → tournament → TAGE / TAGE-SC-L. State-of-the-art as of Cortex-X4/X925.
Perceptron predictor — seen in some custom cores (early AMD K10, rumoured Apple). Weighted sum of history bits.
Flow: fetch block ⇒ predict targets + directions ⇒ redirect fetch same cycle. Latency of redirect = branch mispredict penalty cycles.

The mispredict cost

On a 13-stage pipe with 4-wide fetch, a mispredict costs ~13 bubbles × 4 = 52 lost slots. A 95% accurate predictor on a branchy workload means 5% × 52 = 2.6 lost slots per branch — so predictor quality is often the #1 lever for big-core IPC.

TAGE in 60s

Multiple "tagged geometric" history-length tables index by a hash of (PC, long branch history). Longest table that hits wins. Short-history & bi-modal fall-back tables cover easier branches. Excellent at irregular patterns.

big.LITTLE Origins (2011-2017)

2011: Cortex-A15 + A7 pairing — first big.LITTLE. A15 was a power-hog OoO, A7 a tiny in-order.
Coherency glue: CCI-400 (ACE-based Cache Coherent Interconnect) kept the two clusters' L2 caches coherent when threads migrated.
Three operation modes:
- Cluster Migration (CM) — all cores in one cluster at a time
- CPU Migration (IKS) — pair each A15 with an A7, only one runs
- Global Task Scheduling (GTS) — Linux sees all cores; scheduler migrates freely
Problems: migration latency (>100 µs), OS scheduler bugs, two fixed clusters ⇒ rigid SKUs. iOS never adopted big.LITTLE — Apple went straight to a single-cluster heterogeneous design.

big + LITTLE pair	Year
A15 + A7	2011-14
A57 + A53	2014-16
A72 + A53	2015-16
A73 + A53	2016-17

Why this had to end

Phones wanted 3 or 4 performance tiers (ultra-big / big / mid / little), not 2. Two fixed clusters don't scale. Hence: DynamIQ.

DynamIQ Shared Unit (DSU) — 2017 onwards

Launched with Cortex-A75 / A55 in 2017. Replaces big.LITTLE's two-cluster model.
One cluster containing up to 8 (DSU-110) or 14 (DSU-120) cores of mixed type.
Integrates:
- Snoop filter + SCU (coherence)
- L3 cache — up to 16 MB shared across the cluster
- Asynchronous bridges to the SoC fabric (CMN mesh, NIC, GIC)
- Per-core DVFS domains — each core has its own clock + voltage
Typical 2024 flagship cluster: 1 × Cortex-X925 + 3 × A725 + 4 × A520 on a DSU-120.
Latency between any two cores' L1 is one DSU snoop — dramatically lower than CCI-400 inter-cluster latency.

DSU version	Year	Max cores	Max L3	Shipping in
DSU (v0)	2017	8	4 MB	Kirin 970, Snapdragon 845
DSU-110	2021	8	16 MB	Snapdragon 8 Gen 1/2
DSU-120	2023	14	32 MB	Dimensity 9300, Cortex-X4 platform
DSU-120AE	2024	14	32 MB	Automotive — ASIL-B/D

CHI egress

DSU speaks AMBA 5 CHI out to the CMN mesh / NIC-700. That's the clean interface to the rest of the SoC.

Thermal & Power — Why 1+3+4

Modern phone SoCs have ~5-7 W sustained power budgets. Bursts to 12 W cause thermal throttling within ~30 s.
Benchmarks (Geekbench, Antutu) run single-thread peaks → X-cores are for < 1-second bursts.
Day-to-day browsing & app scrolling uses 1-3 mid cores — sustained efficiency zone.
Background work (notifications, network, music) runs on little cores at 1-2× lower W/perf.
The schedule mapping happens through Android EAS (Energy Aware Scheduling) which models each core's capacity and energy curve.
Thermal headroom is what the next-generation wars are being fought over. Dimensity 9300 dropped little cores entirely (3 × X4 + 4 × A720) to maximise peak; criticised for sustained battery life.

DVFS = voltage / frequency scaling

Per-core DVFS lets the SoC run one X at 3.8 GHz while four A520s sit at 1.0 GHz. Huge energy win versus the old "all cores same freq" models.

AMU — Activity Monitors (v8.4)

Architectural counters the OS can read to estimate perf/watt without proprietary telemetry. Feeds Linux utilclamp and EAS decisions.

PMU — Performance Monitoring

Armv8-A PMU: 1 cycle counter (PMCCNTR_EL0) + 6 programmable counters (PMEVCNTR0-5_EL0) per PE.
Events selected via PMEVTYPER — hundreds of events (cache misses, BP mispredicts, stalls, retired instructions).
Common events:
- CPU_CYCLES, INST_RETIRED
- L1D_CACHE / L1D_CACHE_REFILL
- BR_MIS_PRED_RETIRED
- STALL_FRONTEND / STALL_BACKEND
- Topdown Armv8.8-A events (BAD_SPEC etc.) for Intel-style topdown analysis
Linux perf and Android simpleperf both consume PMU directly.

// Program PMU counter 0 = L1 D-cache miss
msr   pmselr_el0, xzr            // select counter 0
msr   pmxevtyper_el0, #0x03      // L1D_CACHE_REFILL
msr   pmcntenset_el0, #1         // enable counter 0

// Run workload...

msr   pmcntenclr_el0, #1         // disable
mrs   x0, pmxevcntr_el0          // read count

// Linux equivalent
// $ perf stat -e l1d_cache_refill ./bench

v8.6 adds MPMM (Multi-PMU virtualization) so a guest can have its own PMU view.

SPE — Statistical Profiling Extension (v8.2)

PMU counts events in aggregate. SPE samples individual executed operations — like Intel PEBS, AMD IBS.
Every Nth retired µop writes a sample record to a dedicated memory buffer:
- Instruction PC + virtual address of any memory reference
- Latency of that operation
- Outcome (hit / miss / TLB miss / fault / mispredict)
Enables:
- Accurate per-instruction attribution of cache misses, mispredicts, TLB misses
- Cycle counting histograms per source line
- Data-reference profiling for memory layout optimisation
Linux perf record -e arm_spe// harvests the buffer.

Why this matters in HPC/ML

Knowing "this kernel has 5% L3 miss rate" is nearly useless. Knowing "x[i][j] at matmul.c:47 causes 90% of L3 misses and costs 200 cycles each" is actionable. SPE closes that gap on Arm.

ETE / TRBE (v8.4 / v9)

Successor to ETM — stores program trace in a TRBE buffer in DRAM. Real-time, non-invasive. Used for post-mortem kernel crash analysis.

L1 / L2 / L3 Cache Sizes Over Generations

Core	L1-I/D	L2 (private)	L3 (DSU)
A8	32 KB / 32 KB	0-1 MB shared	-
A15	32 KB / 32 KB	cluster-shared, up to 4 MB	-
A57	48 KB / 32 KB	cluster-shared 1-2 MB	-
A72	48 KB / 32 KB	cluster-shared 1-2 MB	-
A76	64 KB / 64 KB	256 KB-512 KB private	up to 4 MB
A78	64 KB / 64 KB	256 KB-512 KB	up to 8 MB
X1	64 KB / 64 KB	up to 1 MB	up to 8 MB
X4 / A720	64 KB / 64 KB	2 MB / 512 KB-1 MB	up to 32 MB (DSU-120)
X925 / A725	64 KB / 64 KB	3 MB / 1 MB	up to 32 MB

Two big shifts: (1) 2018 — L2 became private, matched with introduction of the shared L3 in DSU. (2) 2023 — L2 exploded to 2-3 MB per X-core, approaching laptop-class sizes.

Load/Store Unit — the Quiet Bottleneck

Modern big Cortex-A: 2 load + 2 store pipes (X1+ has 3 load).
Store buffer: 40-60 entries. Coalesces consecutive stores to same line before committing to L1.
Memory dependency prediction — speculates whether a load depends on an earlier store in the queue. Mispredict → replay the load.
DSB stagger — when a real DSB drains the store queue, it blocks the pipeline for tens of cycles. One reason release/acquire is preferred.
Atomic fast-path: LSE atomics (CAS, LDADD) bypass the store queue and go straight to the cache controller with a near-atomic request.

DC ZVA — the allocate-without-read trick

Allocates a cache line in L1 without an initial fill from DRAM; then writes zero into the whole line. Used in memset(0), calloc, page zeroing. Often 3-5× faster than a STR #0 loop.

STP pair-ops

STP Xn, Xm, [sp,#-16]! is a single µop that writes 16 bytes. Why every function prologue/epilogue on AArch64 uses STP/LDP — double the store width.

What the Compiler Sees — ACLE & Attributes

Tuning switches — compiler picks instruction scheduling & feature subsets per core:
- -mcpu=cortex-x4
- -mcpu=neoverse-v2
- -march=armv9-a+sve2+bf16
Function-level hints:
- __attribute__((target("sve"))) — dispatch-by-target
- __builtin_prefetch(p, rw, locality) → PRFM
- __builtin_arm_mte — MTE intrinsics
Pragma-level: #pragma GCC ivdep, #pragma clang loop vectorize(enable).

// Compiler-targeted optimisation

// Force SVE code-gen path
__attribute__((target("+sve")))
float sve_dot(const float *a, const float *b, size_t n) {
  // compiler will emit VLA SVE loop
  ...
}

// Prefetch hint
for (size_t i = 0; i < n; i += 64) {
  __builtin_prefetch(&arr[i + 128], 0, 0); // PRFM pldl1strm
  process(&arr[i]);
}

// Runtime dispatch (outline atomics style)
ifunc impl = runtime_has_sve() ? sve_kernel : neon_kernel;

Neoverse Heritage — Cortex-A → Server

Every Neoverse core is a re-characterised Cortex-A:
- Neoverse N1 = Cortex-A76 (2018)
- Neoverse V1 = Cortex-X1 + 2×256-bit SVE (2021)
- Neoverse N2 = Cortex-A710 (v9-A)
- Neoverse V2 = Cortex-X3 + 4×128-bit SVE (2022)
- Neoverse V3 / N3 (2024) — shares front-end DNA with X4 / A720
Server-side changes:
- Dedicated CMN mesh (not DSU)
- RAS, MPAM, SBSA compliance
- More L2 + SLC, wider memory I/F (DDR5 / HBM)
See the Neoverse Presentation Series for the server story.

Why a common core

Sharing microarchitecture between Cortex-A and Neoverse amortises R&D across mobile + server markets — Arm ships > 100 M phones per quarter and > 10 M server cores per year off the same design dollars.

Custom licensees

Apple, Qualcomm Nuvia/Oryon, Ampere AmpereOne, and AWS Graviton 4 all run the Arm ISA but implement their own microarchitecture. None of these use Cortex-A RTL.

Series Finale — Putting It All Together

Across the 6 decks:
- 01 History — why the family looks the way it does
- 02 Architecture — ELs, A64, system regs
- 03 Memory — VMSA, caches, ordering, atomics
- 04 Vectors — NEON → SVE → SVE2 → SME
- 05 Security — TrustZone, EL2, PAC/BTI/MTE, CCA
- 06 Microarchitecture — OoO, DynamIQ, PMU
Together they cover what a phone/SoC/kernel Arm engineer is expected to know — and what a Neoverse engineer needs before deck 01 of that series.

What I'd study next

AMBA series — everything below the core (AXI, ACE, CHI)
Neoverse series — same cores, server deployment
Arm System IP series — GIC, SMMU, DSU, MPAM
Modern SoC Design series — packaging, chiplets, SerDes, CXL

Lessons

"Why did A76 shorten the pipeline vs A72?" → higher IPC + lower mispredict cost + 7 nm process allowed physical designs with fewer stages at same clock.
"Explain DynamIQ vs big.LITTLE" → one cluster, many core types, shared L3, per-core DVFS; vs two fixed clusters, separate L2s, CCI-400 inter-cluster coherence.
"Why is RSB important?" → returns are indirect but deterministic — a small RSB predicts them perfectly. Attackers have exploited RSB underflows for Spectre-RSB variants.

"What's the difference between PMU and SPE?" → PMU counts events in aggregate; SPE samples individual operations with per-op latency & attribution.
"Why 1+3+4?" → thermal + benchmark + efficiency blend. 1 X for burst benchmarks, 3 mid for sustained, 4 little for background — fits a 5-7 W budget.
"Why does Arm keep making L3 bigger?" → DRAM bandwidth per core didn't grow at Moore rate. Larger L3 hides more misses. X925 + DSU-120 pushes toward laptop-sized LLCs.
"DSU vs CMN?" → DSU is a cluster-level shared L3 + snoop filter; CMN is a system-level mesh that connects multiple DSUs + I/O + memory controllers.

Closing: What the Next 5 Years Look Like

Bigger front-ends — 10-wide decode likely permanent; Apple-class 12-16-wide within reach.
SVE2 as default vectors — NEON in compat mode only.
SME on flagships — matrix units as standard for on-device LLM.
Memory tagging universal — MTE async in every Armv9-A shipping kernel.
Laptop-class Cortex-A — Windows-on-Arm + Android XR + Chromebook share the same X-core die budgets.
CCA on phones — protected media + on-device AI inference in Realm world.

The bar for an Arm hire

A serious Arm microarchitecture review covers: branch prediction, OoO recovery, memory ordering, cache coherence, PMU analysis, power/thermal modelling, and at least awareness of PAC/BTI/MTE/CCA. The six decks in this series are the map.

References

Arm Ltd. — per-core Technical Reference Manuals (A76, A78, X1, X3, X4, A520, A725, X925) — publicly available
Arm Ltd. — DynamIQ Shared Unit TRM (DSU-110, DSU-120) — cache hierarchy, coherence, CHI egress
Arm Ltd. — Arm PMU Architecture Reference Manual (DDI 0601) — PMU + SPE + AMU
Shen & Lipasti — Modern Processor Design (McGraw-Hill, 2013) — OoO pipeline fundamentals
Seznec, A. — "A case for (partially) TAGged geometric history length branch prediction" (ISCA 2006)
Chipsandcheese.com — microarchitecture deep-dives on Neoverse V2, Cortex-X4, Apple M-series
AnandTech — Andrei Frumusanu archive (2015-2023) — per-release benchmark + IPC analysis
Hot Chips / ISSCC proceedings — annual Cortex-X and Neoverse architecture talks
Linux kernel — arch/arm64/ — canonical reader of every A-profile PMU / cache / memory feature

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.