ARM CORTEX-A · PRESENTATION 06

Microarchitecture — OoO, DynamIQ, PMU

From A8's in-order pipeline to X925's 10-wide OoO beast
Fetch · Decode · Rename · ROB · Issue · L/S · Branch Prediction · DynamIQ DSU · PMU · SPE · TRBE
02

From In-Order to Out-of-Order

CoreDecodeIssueStagesComment
Cortex-A822 in-order13Baseline, no OoO
Cortex-A924 OoO (partial)8-11First Arm with register rename
Cortex-A1538 OoO15First "big" OoO
Cortex-A5738 OoO15-17First AArch64 big core
Cortex-A7238 OoO15Refined A57, lower power
Cortex-A7648 OoO13Shorter pipeline, higher IPC
Cortex-A7848 OoO13Power-tuned A77
Cortex-X158 OoO13First "Custom Core"; wider front-end
Cortex-X410 dispatch10+ OoO~15Largest ROB to date in X3 (~384 entries)
Cortex-X9251010+ OoO15+2024 "Blackhawk" — flagship width

Two trends over 20 years: wider front-ends (2 → 10 decode) and shorter pipelines (A15's 15 stages → A76's 13). Width grows IPC; shorter pipe reduces mis-predict penalty + improves branchy code.

03

Canonical OoO Pipeline (Cortex-A78 / X1 class)

Cortex-A76/A78 pipeline (schematic) Fetch (4W) Predict BTB/RSB/BHT Decode (4W) Rename/Dispatch ~200-entry ROB Issue queue per unit 2× ALU + MAC + DIV 2× FP/SIMD (128-bit) 2× Load/Store Branch unit Commit in-order L1-I 64 KB / BTB ~6000 entries / RSB 16 / Tournament + TAGE-SC · L1-D 64 KB · L2 512 KB – 1 MB private

The "big" Cortex-A pattern: 4-wide fetch/decode/rename, 8-wide issue across heterogeneous EUs, 2-3 load + store pipes, in-order commit. X-class widens this to 10-wide decode + 10+ issue.

04

Branch Prediction — The Other Half of IPC

  • BTB (Branch Target Buffer) — direct branches. Modern cores: 2-level, 2k-4k entries L0 + 16k+ entries L1.
  • Indirect predictor — for BR Xn. Smaller (hundreds of entries) but often path-hashed for vtable dispatch.
  • RSB (Return Stack Buffer) — predicts return targets. 16 entries typical. Overflows to BTB.
  • Conditional predictor — evolved from 2-bit counters → tournament → TAGE / TAGE-SC-L. State-of-the-art as of Cortex-X4/X925.
  • Perceptron predictor — seen in some custom cores (early AMD K10, rumoured Apple). Weighted sum of history bits.
  • Flow: fetch block ⇒ predict targets + directions ⇒ redirect fetch same cycle. Latency of redirect = branch mispredict penalty cycles.

The mispredict cost

On a 13-stage pipe with 4-wide fetch, a mispredict costs ~13 bubbles × 4 = 52 lost slots. A 95% accurate predictor on a branchy workload means 5% × 52 = 2.6 lost slots per branch — so predictor quality is often the #1 lever for big-core IPC.

TAGE in 60s

Multiple "tagged geometric" history-length tables index by a hash of (PC, long branch history). Longest table that hits wins. Short-history & bi-modal fall-back tables cover easier branches. Excellent at irregular patterns.

05

big.LITTLE Origins (2011-2017)

  • 2011: Cortex-A15 + A7 pairing — first big.LITTLE. A15 was a power-hog OoO, A7 a tiny in-order.
  • Coherency glue: CCI-400 (ACE-based Cache Coherent Interconnect) kept the two clusters' L2 caches coherent when threads migrated.
  • Three operation modes:
    • Cluster Migration (CM) — all cores in one cluster at a time
    • CPU Migration (IKS) — pair each A15 with an A7, only one runs
    • Global Task Scheduling (GTS) — Linux sees all cores; scheduler migrates freely
  • Problems: migration latency (>100 µs), OS scheduler bugs, two fixed clusters ⇒ rigid SKUs. iOS never adopted big.LITTLE — Apple went straight to a single-cluster heterogeneous design.
big + LITTLE pairYear
A15 + A72011-14
A57 + A532014-16
A72 + A532015-16
A73 + A532016-17

Why this had to end

Phones wanted 3 or 4 performance tiers (ultra-big / big / mid / little), not 2. Two fixed clusters don't scale. Hence: DynamIQ.

06

DynamIQ Shared Unit (DSU) — 2017 onwards

  • Launched with Cortex-A75 / A55 in 2017. Replaces big.LITTLE's two-cluster model.
  • One cluster containing up to 8 (DSU-110) or 14 (DSU-120) cores of mixed type.
  • Integrates:
    • Snoop filter + SCU (coherence)
    • L3 cache — up to 16 MB shared across the cluster
    • Asynchronous bridges to the SoC fabric (CMN mesh, NIC, GIC)
    • Per-core DVFS domains — each core has its own clock + voltage
  • Typical 2024 flagship cluster: 1 × Cortex-X925 + 3 × A725 + 4 × A520 on a DSU-120.
  • Latency between any two cores' L1 is one DSU snoop — dramatically lower than CCI-400 inter-cluster latency.
DSU versionYearMax coresMax L3Shipping in
DSU (v0)201784 MBKirin 970, Snapdragon 845
DSU-1102021816 MBSnapdragon 8 Gen 1/2
DSU-12020231432 MBDimensity 9300, Cortex-X4 platform
DSU-120AE20241432 MBAutomotive — ASIL-B/D

CHI egress

DSU speaks AMBA 5 CHI out to the CMN mesh / NIC-700. That's the clean interface to the rest of the SoC.

07

Thermal & Power — Why 1+3+4

  • Modern phone SoCs have ~5-7 W sustained power budgets. Bursts to 12 W cause thermal throttling within ~30 s.
  • Benchmarks (Geekbench, Antutu) run single-thread peaks → X-cores are for < 1-second bursts.
  • Day-to-day browsing & app scrolling uses 1-3 mid cores — sustained efficiency zone.
  • Background work (notifications, network, music) runs on little cores at 1-2× lower W/perf.
  • The schedule mapping happens through Android EAS (Energy Aware Scheduling) which models each core's capacity and energy curve.
  • Thermal headroom is what the next-generation wars are being fought over. Dimensity 9300 dropped little cores entirely (3 × X4 + 4 × A720) to maximise peak; criticised for sustained battery life.

DVFS = voltage / frequency scaling

Per-core DVFS lets the SoC run one X at 3.8 GHz while four A520s sit at 1.0 GHz. Huge energy win versus the old "all cores same freq" models.

AMU — Activity Monitors (v8.4)

Architectural counters the OS can read to estimate perf/watt without proprietary telemetry. Feeds Linux utilclamp and EAS decisions.

08

PMU — Performance Monitoring

  • Armv8-A PMU: 1 cycle counter (PMCCNTR_EL0) + 6 programmable counters (PMEVCNTR0-5_EL0) per PE.
  • Events selected via PMEVTYPER — hundreds of events (cache misses, BP mispredicts, stalls, retired instructions).
  • Common events:
    • CPU_CYCLES, INST_RETIRED
    • L1D_CACHE / L1D_CACHE_REFILL
    • BR_MIS_PRED_RETIRED
    • STALL_FRONTEND / STALL_BACKEND
    • Topdown Armv8.8-A events (BAD_SPEC etc.) for Intel-style topdown analysis
  • Linux perf and Android simpleperf both consume PMU directly.
// Program PMU counter 0 = L1 D-cache miss
msr   pmselr_el0, xzr            // select counter 0
msr   pmxevtyper_el0, #0x03      // L1D_CACHE_REFILL
msr   pmcntenset_el0, #1         // enable counter 0

// Run workload...

msr   pmcntenclr_el0, #1         // disable
mrs   x0, pmxevcntr_el0          // read count

// Linux equivalent
// $ perf stat -e l1d_cache_refill ./bench

v8.6 adds MPMM (Multi-PMU virtualization) so a guest can have its own PMU view.

09

SPE — Statistical Profiling Extension (v8.2)

  • PMU counts events in aggregate. SPE samples individual executed operations — like Intel PEBS, AMD IBS.
  • Every Nth retired µop writes a sample record to a dedicated memory buffer:
    • Instruction PC + virtual address of any memory reference
    • Latency of that operation
    • Outcome (hit / miss / TLB miss / fault / mispredict)
  • Enables:
    • Accurate per-instruction attribution of cache misses, mispredicts, TLB misses
    • Cycle counting histograms per source line
    • Data-reference profiling for memory layout optimisation
  • Linux perf record -e arm_spe// harvests the buffer.

Why this matters in HPC/ML

Knowing "this kernel has 5% L3 miss rate" is nearly useless. Knowing "x[i][j] at matmul.c:47 causes 90% of L3 misses and costs 200 cycles each" is actionable. SPE closes that gap on Arm.

ETE / TRBE (v8.4 / v9)

Successor to ETM — stores program trace in a TRBE buffer in DRAM. Real-time, non-invasive. Used for post-mortem kernel crash analysis.

10

L1 / L2 / L3 Cache Sizes Over Generations

CoreL1-I/DL2 (private)L3 (DSU)
A832 KB / 32 KB0-1 MB shared-
A1532 KB / 32 KBcluster-shared, up to 4 MB-
A5748 KB / 32 KBcluster-shared 1-2 MB-
A7248 KB / 32 KBcluster-shared 1-2 MB-
A7664 KB / 64 KB256 KB-512 KB privateup to 4 MB
A7864 KB / 64 KB256 KB-512 KBup to 8 MB
X164 KB / 64 KBup to 1 MBup to 8 MB
X4 / A72064 KB / 64 KB2 MB / 512 KB-1 MBup to 32 MB (DSU-120)
X925 / A72564 KB / 64 KB3 MB / 1 MBup to 32 MB

Two big shifts: (1) 2018 — L2 became private, matched with introduction of the shared L3 in DSU. (2) 2023 — L2 exploded to 2-3 MB per X-core, approaching laptop-class sizes.

11

Load/Store Unit — the Quiet Bottleneck

  • Modern big Cortex-A: 2 load + 2 store pipes (X1+ has 3 load).
  • Store buffer: 40-60 entries. Coalesces consecutive stores to same line before committing to L1.
  • Memory dependency prediction — speculates whether a load depends on an earlier store in the queue. Mispredict → replay the load.
  • DSB stagger — when a real DSB drains the store queue, it blocks the pipeline for tens of cycles. One reason release/acquire is preferred.
  • Atomic fast-path: LSE atomics (CAS, LDADD) bypass the store queue and go straight to the cache controller with a near-atomic request.

DC ZVA — the allocate-without-read trick

Allocates a cache line in L1 without an initial fill from DRAM; then writes zero into the whole line. Used in memset(0), calloc, page zeroing. Often 3-5× faster than a STR #0 loop.

STP pair-ops

STP Xn, Xm, [sp,#-16]! is a single µop that writes 16 bytes. Why every function prologue/epilogue on AArch64 uses STP/LDP — double the store width.

12

What the Compiler Sees — ACLE & Attributes

  • Tuning switches — compiler picks instruction scheduling & feature subsets per core:
    • -mcpu=cortex-x4
    • -mcpu=neoverse-v2
    • -march=armv9-a+sve2+bf16
  • Function-level hints:
    • __attribute__((target("sve"))) — dispatch-by-target
    • __builtin_prefetch(p, rw, locality) → PRFM
    • __builtin_arm_mte — MTE intrinsics
  • Pragma-level: #pragma GCC ivdep, #pragma clang loop vectorize(enable).
// Compiler-targeted optimisation

// Force SVE code-gen path
__attribute__((target("+sve")))
float sve_dot(const float *a, const float *b, size_t n) {
  // compiler will emit VLA SVE loop
  ...
}

// Prefetch hint
for (size_t i = 0; i < n; i += 64) {
  __builtin_prefetch(&arr[i + 128], 0, 0); // PRFM pldl1strm
  process(&arr[i]);
}

// Runtime dispatch (outline atomics style)
ifunc impl = runtime_has_sve() ? sve_kernel : neon_kernel;
13

Neoverse Heritage — Cortex-A → Server

  • Every Neoverse core is a re-characterised Cortex-A:
    • Neoverse N1 = Cortex-A76 (2018)
    • Neoverse V1 = Cortex-X1 + 2×256-bit SVE (2021)
    • Neoverse N2 = Cortex-A710 (v9-A)
    • Neoverse V2 = Cortex-X3 + 4×128-bit SVE (2022)
    • Neoverse V3 / N3 (2024) — shares front-end DNA with X4 / A720
  • Server-side changes:
    • Dedicated CMN mesh (not DSU)
    • RAS, MPAM, SBSA compliance
    • More L2 + SLC, wider memory I/F (DDR5 / HBM)
  • See the Neoverse Presentation Series for the server story.

Why a common core

Sharing microarchitecture between Cortex-A and Neoverse amortises R&D across mobile + server markets — Arm ships > 100 M phones per quarter and > 10 M server cores per year off the same design dollars.

Custom licensees

Apple, Qualcomm Nuvia/Oryon, Ampere AmpereOne, and AWS Graviton 4 all run the Arm ISA but implement their own microarchitecture. None of these use Cortex-A RTL.

14

Series Finale — Putting It All Together

  • Across the 6 decks:
    • 01 History — why the family looks the way it does
    • 02 Architecture — ELs, A64, system regs
    • 03 Memory — VMSA, caches, ordering, atomics
    • 04 Vectors — NEON → SVE → SVE2 → SME
    • 05 Security — TrustZone, EL2, PAC/BTI/MTE, CCA
    • 06 Microarchitecture — OoO, DynamIQ, PMU
  • Together they cover what a phone/SoC/kernel Arm engineer is expected to know — and what a Neoverse engineer needs before deck 01 of that series.

What I'd study next

  • AMBA series — everything below the core (AXI, ACE, CHI)
  • Neoverse series — same cores, server deployment
  • Arm System IP series — GIC, SMMU, DSU, MPAM
  • Modern SoC Design series — packaging, chiplets, SerDes, CXL
15

Lessons

  • "Why did A76 shorten the pipeline vs A72?" → higher IPC + lower mispredict cost + 7 nm process allowed physical designs with fewer stages at same clock.
  • "Explain DynamIQ vs big.LITTLE" → one cluster, many core types, shared L3, per-core DVFS; vs two fixed clusters, separate L2s, CCI-400 inter-cluster coherence.
  • "Why is RSB important?" → returns are indirect but deterministic — a small RSB predicts them perfectly. Attackers have exploited RSB underflows for Spectre-RSB variants.
  • "What's the difference between PMU and SPE?" → PMU counts events in aggregate; SPE samples individual operations with per-op latency & attribution.
  • "Why 1+3+4?" → thermal + benchmark + efficiency blend. 1 X for burst benchmarks, 3 mid for sustained, 4 little for background — fits a 5-7 W budget.
  • "Why does Arm keep making L3 bigger?" → DRAM bandwidth per core didn't grow at Moore rate. Larger L3 hides more misses. X925 + DSU-120 pushes toward laptop-sized LLCs.
  • "DSU vs CMN?" → DSU is a cluster-level shared L3 + snoop filter; CMN is a system-level mesh that connects multiple DSUs + I/O + memory controllers.
16

Closing: What the Next 5 Years Look Like

  • Bigger front-ends — 10-wide decode likely permanent; Apple-class 12-16-wide within reach.
  • SVE2 as default vectors — NEON in compat mode only.
  • SME on flagships — matrix units as standard for on-device LLM.
  • Memory tagging universal — MTE async in every Armv9-A shipping kernel.
  • Laptop-class Cortex-A — Windows-on-Arm + Android XR + Chromebook share the same X-core die budgets.
  • CCA on phones — protected media + on-device AI inference in Realm world.

The bar for an Arm hire

A serious Arm microarchitecture review covers: branch prediction, OoO recovery, memory ordering, cache coherence, PMU analysis, power/thermal modelling, and at least awareness of PAC/BTI/MTE/CCA. The six decks in this series are the map.

17

References

Arm Ltd. — per-core Technical Reference Manuals (A76, A78, X1, X3, X4, A520, A725, X925) — publicly available
Arm Ltd.DynamIQ Shared Unit TRM (DSU-110, DSU-120) — cache hierarchy, coherence, CHI egress
Arm Ltd.Arm PMU Architecture Reference Manual (DDI 0601) — PMU + SPE + AMU
Shen & LipastiModern Processor Design (McGraw-Hill, 2013) — OoO pipeline fundamentals
Seznec, A. — "A case for (partially) TAGged geometric history length branch prediction" (ISCA 2006)
Chipsandcheese.com — microarchitecture deep-dives on Neoverse V2, Cortex-X4, Apple M-series
AnandTech — Andrei Frumusanu archive (2015-2023) — per-release benchmark + IPC analysis
Hot Chips / ISSCC proceedings — annual Cortex-X and Neoverse architecture talks
Linux kernelarch/arm64/ — canonical reader of every A-profile PMU / cache / memory feature

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.