ARM NEOVERSE · PRESENTATION 02

Microarchitecture & Core Details

What Neoverse cores add on top of Cortex-X/-A · N vs V vs E
N1 · V1 · N2 · V2 · N3 · V3 · SMT2 · RAS · MPAM · ROBs · Prefetch · LSE on mesh
02

What Changes When Cortex Goes Server

  • RAS everywhere — ECC on L1/L2/L3, poison handling, containable aborts, SDEI for fatal-error signalling.
  • Larger TLBs + TLB hierarchy — servers touch much larger working sets; N2 has ~2-3× the L2 TLB of Cortex-A710.
  • Server-tuned prefetchers — favour large strides, DRAM-bandwidth-aware throttling.
  • CHI-B (N2+) egress — not DSU: each core sits on a CMN mesh tile directly.
  • MPAM v1 — allocate cache ways + memory bandwidth per partition ID. Mandatory for multi-tenant cloud.
  • Clock/voltage optimised for 2.5-3.3 GHz sustained — not 3.5+ GHz burst. Better perf/W at scale.

Why not just use Cortex-A?

A phone Cortex-X runs at 3.5 GHz for 200 ms and thermally throttles. A server Neoverse runs at 3.0 GHz continuously for 5 years. That requires very different silicon validation, leakage budgets, and RAS coverage — worth a separate product line.

Area ≠ wasted

Extra TLB + RAS + MPAM makes Neoverse ~15-20% larger than its Cortex-X cousin. At datacentre scale, that area is recovered 10× by better availability.

03

Neoverse N1 — Under the Hood

ParameterValue
ArchitectureArmv8.2-A (no SVE)
Decode / Dispatch4-wide / 4-wide
Issue width8 (2 ALU + MAC, 2 load/store, 2 FP/SIMD, 1 branch, 1 DIV)
Pipeline depth~11 stages (front) + variable back
ROB / renamed regs~128 / ~200 int, ~150 FP
L1-I / L1-D64 KB / 64 KB, 4-way
L2 (private)256 KB – 1 MB, 8-way
L3 / SLCvia CMN-600, 1-2 MB/core
SIMDNEON 2 × 128-bit pipes
SMT— (single-thread only)
Target clock2.5 – 3.0 GHz
Process7 nm TSMC (Graviton 2)
04

Neoverse V1 — the Wide / SVE Core

  • Based on Cortex-X1. 5-wide decode, 8-wide issue, deeper OoO.
  • SVE: 2 × 256-bit FP/int pipes (twice the vector throughput of a 2 × 128-bit Neoverse V2). Best for dense FP kernels.
  • SMT2 — two hardware threads per core. First Arm IP core with SMT.
  • Larger ROB (~320 entries), richer prefetcher.
  • Private L2 up to 2 MB, 8-way.
  • bf16 / INT8 dot-product / matmul — v8.6-A extensions targeting ML.
  • First silicon: AWS Graviton 3 (Nov 2021) — 64 cores, 7-chiplet design, DDR5-4800, PCIe Gen 5.

SVE 256-bit vs 4×128

Same total FLOPs, but 2 × 256-bit is better for 256-bit-wide loads / stores and HPC kernels that naturally fit 256-bit. 4 × 128-bit gives more dispatch flexibility for cloud DB workloads. V1 chose wide-and-few; V2 chose narrow-and-many.

Graviton 3's chiplet trick

7 compute chiplets × ~9 cores + 1 I/O die, connected on an Amazon-designed organic interposer. Similar idea to AMD's Epyc. Cost advantage — yield better on 7 smaller dies than 1 huge die.

05

Neoverse N2 — the Armv9 Workhorse

  • Based on Cortex-A710 (mobile flagship, v9-A). 5-wide decode, 8-wide issue, 4 × 128-bit SVE2.
  • Mandatory SVE2 — first Neoverse with vector-length-agnostic ops.
  • Armv9-A features: RME / CCA (confidential compute), BTI, MTE.
  • Private L2 up to 1 MB. CHI-B mesh egress.
  • MPAM v1, SPE (statistical profiling), ETE/TRBE (embedded trace).
  • Sweet spot: 64-128 cores per socket at 2.7-3.2 GHz, DDR5 / LPDDR5X, CXL 2.0.
  • Canonical silicon: Microsoft Cobalt 100 (128 cores, Azure), Alibaba Yitian 710 (128 cores, 2021 early adopter), AWS Graviton 4 (96 × N2 + some V2 variants).
FeatureN1N2
Archv8.2-Av9.0-A
SIMDNEON 2×128SVE2 4×128
BTI / MTENoYes
RME / CCANoYes
MPAM-v1 mandatory
MeshCMN-600CMN-700
Peak SPECint/core~20~30
06

Neoverse V2 — Grace Inside

  • Based on Cortex-X3. 6-wide decode. Biggest IPC jump in Neoverse history at the time (~35% SPECint over V1).
  • SIMD: 4 × 128-bit SVE2 (not 2 × 256 like V1) — better for dispatch-limited code, matches V1 in total FLOPs.
  • SMT2 retained.
  • Larger ROB (~384 entries), refined TAGE predictor.
  • Private L2 up to 2 MB, 8-way.
  • Canonical silicon: NVIDIA Grace CPU — 72 × V2 cores per chiplet, 480 GB LPDDR5X on-package, NVLink-C2C to Hopper/Blackwell GPU at 900 GB/s coherent.

Grace + Hopper

The CPU-GPU "superchip" pairs V2 Grace CPU with Hopper/Blackwell GPU in one board. NVLink-C2C makes GPU see CPU LPDDR5X coherently — no PCIe. Opens door to ML models that spill to host memory transparently.

HPC scaling

Grace in pure CPU mode (72+72 = 144 cores, 960 GB LPDDR5X) targets AI training storage nodes and in-memory databases. Shipping in HPE Slingshot, Eviden BullSequana, several European HPC centres.

07

Neoverse N3 / V3 — 2024 Refresh

  • N3 — successor to N2. ~20% IPC gain, ~25% energy reduction.
    • Ported A720-class core
    • Improved BF16/INT8 matmul path — "CPU AI inference" story
    • 4 × 128-bit SVE2 retained
  • V3 — successor to V2. ~35% SPECint gain, bigger per-core resources.
    • X4-class front-end, ~384-entry ROB, wider dispatch
    • New "data-dependent prefetcher" — latency-hiding for pointer-chasing workloads (DB B-trees)
    • CCA hardware acceleration — near-zero overhead for Realms
  • Delivered as CSS N3 / CSS V3 — pre-integrated with CMN S3 mesh + GIC-700 + SMMU-700.

CPU AI inference

N3's bf16/INT8 matmul (BFMMLA, SMMLA) gives ~2-3× LLM-token/sec compared to N2 at similar power. The pitch: avoid a separate NPU for small-model inference, keep everything on the CPU.

V3's prefetcher

Picks up pointer-chase patterns (linked list, B-tree traversal). Not just stride-based. Arm reports ~15% Postgres and Memcached improvements from this feature alone.

08

SMT2 in Detail

  • V1 and V2 (and V3) implement SMT2 — two hardware threads per physical core.
  • Each thread has its own:
    • Architectural register file
    • Rename map
    • RSB
    • PMU counters
  • Shared:
    • Execution units
    • L1/L2 caches
    • Branch predictor tables
    • TLBs
  • Each thread presents as a separate PE to the OS, with its own MPIDR Aff0.

When SMT helps

Memory-bound, latency-tolerant workloads — DB joins, in-memory KV, nginx request handling. One thread stalls on L3 miss while the other keeps the EUs busy. Typical gains: 15-30% throughput at ~5% per-thread latency cost.

When SMT hurts

Branch-heavy single-thread workloads (compression, ray tracing) — two threads thrash branch-predictor tables. HPC codes that fit comfortably in L1/L2 see cache-thrashing losses. Often configured off for pure HPC.

09

RAS — Reliability, Availability, Serviceability

  • Servers run 24/7 for years. RAS features turn "crash" into "detected, contained, reported."
  • Arm RAS architecture (v8.2-A onward) defines a standard framework:
    • Error records per PE (ERXADDR_EL1, ERXSTATUS_EL1)
    • Severity levels: UEC (uncorrected, contained), UER (recoverable), UEO (outer), DE (deferred)
    • SError + SDEI signalling to EL3 / EL2 / EL1
    • Poison propagation: bad cache line carries a 1-bit "poison" until consumed
  • Neoverse cores implement ECC on:
    • L1 D-cache, L2, L3 tags + data
    • TLB entries
    • RF (parity on register file)
  • Memory scrubbing at CMN / memory controller level.

Poison — the real trick

When a cache line takes an uncorrectable error, the CPU doesn't kill the system. It marks the line as poisoned. The OS/firmware only needs to deal with it when (and if) a process actually reads from that exact line.

SDEI

Software Delegated Exception Interface — an EL3 firmware service that forwards RAS events to EL2/EL1 asynchronously. Lets Linux log + quarantine a faulty page without taking the whole kernel down.

10

MPAM — Memory Partitioning & Monitoring

  • Armv8.4-A optional, mandatory from Neoverse N2.
  • Each memory access is tagged with a PartID (partition ID) + PMG (monitoring group ID).
  • At each MPAM-aware point (L2, L3, CMN, memory controller), resources are partitioned by PartID:
    • Cache way/capacity allocation (cache-set like Intel CAT)
    • Bandwidth quotas (min / max BW per partition)
    • Monitoring counters for observed use
  • OS (Linux resctrl) or hypervisor assigns PartIDs per process / VM.
  • The enabler for multi-tenant cloud QoS — noisy-neighbour workloads can be capped.
// Linux MPAM (resctrl mount)
// Create a partition for tenant A
mount -t resctrl resctrl /sys/fs/resctrl

mkdir /sys/fs/resctrl/tenantA

echo "L3:0=00ff00;MB:0=50" \
     > /sys/fs/resctrl/tenantA/schemata

// L3 mask = 00ff00 → 8 ways allocated
// MB     = 50     → 50% of memory bw
// CPUs 4-7 belong to this partition
echo "4-7" > /sys/fs/resctrl/tenantA/cpus

MPAM was designed to match Intel CAT/CDP/MBA — same hyperscaler-friendly partition model, Arm-architectural instead of Intel-proprietary.

11

Neoverse Comparison Table

ParamN1V1E1N2V2N3V3
Year2019202120202021202220242024
Archv8.2v8.4v8.2v9.0v9.0v9.2v9.2
Decode45256510
Issue88388810+
SIMDNEON2×256 SVENEON4×128 SVE24×128 SVE24×128 SVE24×128 SVE2
SMT-22-2-2
Private L21 MB2 MB128 KB1 MB2 MB1-2 MB2-3 MB
Max cores / socket80-12864-12872-144192+96
MeshCMN-600CMN-650/700CMN-600CMN-700CMN-700CMN S3CMN S3

"E1 decode=2" reflects its in-order dual-issue design. Not shown: custom cores (Ampere AmpereOne A192 uses 192 in-house cores; Apple Silicon is outside the Neoverse family entirely).

12

Cache Hierarchy on a Neoverse Socket

Neoverse N2 / V2 socket L1-I 64 KB L1-D 64 KB per core L2 private (1-2 MB, ECC) CMN-700 / CMN-S3 mesh SLC slice per HN-F + snoop filter SLC — 128-512 MB across mesh 8-12 ch DDR5 / HBM3 / LPDDR5X
  • L1 — 4-way, VIPT tag, 4-cycle hit.
  • L2 — 8-way, private, 8-12 cycle hit. ECC single-bit correct, double-bit detect.
  • Mesh / SLC — distributed across HN-F home nodes. Each hop ~3-5 cycles. 48-80 cycle miss-to-DRAM latency total.
  • DRAM — 8-12 ch DDR5 4800/5600 typical, up to 12 ch DDR5-6000 on Graviton 4 and Cobalt 100. HBM3 on Grace.
13

LSE Atomics on a Mesh

  • LDXR/STXR exclusive-monitor atomics scale poorly beyond ~16 cores — multiple exclusive locks ping-pong, livelock can ensue.
  • LSE atomics (CAS, LDADD, SWP) in Neoverse are implemented as near-atomics at the HN-F: the home node for that cache line executes the RMW on behalf of the requester, without transferring the line.
  • This scales to 128+ cores with near-zero tail latency — the hot counter / lock simply queues at the HN-F.
  • Linux -moutline-atomics flag (glibc 2.32+) compiles binaries that pick LSE at runtime.
  • On N1+, LSE cuts a mutex-heavy benchmark (Memcached, Redis SET heavy) by 20-40% versus the LDXR/STXR equivalent.
// LSE atomic vs legacy LDXR/STXR

// Legacy — can livelock
retry:
    ldxr   w1, [x0]
    add    w1, w1, #1
    stxr   w2, w1, [x0]
    cbnz   w2, retry

// LSE (Neoverse) — single op, HN-F does it
    ldadd  w3, w1, [x0]      // atomic fetch-add
// round-trip: 1 REQ → HN-F, 1 RSP back
// no need to own line in Unique state

The CHI protocol at the HN-F has dedicated atomic RMW opcodes; the home node's cache controller executes them without the requester ever taking the line.

14

Clock, Voltage, Power Plans

  • Typical sustained clock: 2.5 – 3.3 GHz. Cortex-X can hit 3.5+ GHz on mobile thanks to bursty workload; servers want sustained & thermally stable.
  • Per-core DVFS domains (N1+). Idle cores drop to very low power.
  • Graviton 3 reports ~60 W TDP for 64 cores (~1 W/core at full load). Xeon 6430 at 64 cores: ~250 W.
  • Power-state hints:
    • WFI / WFE / WFIT / WFET
    • PSCI CPU_SUSPEND
    • Per-core clock gating
    • L3 way power-down on idle
  • AMU (Activity Monitors v8.4-A) gives the OS a hardware view of "busy cycles" per core; Linux cpufreq uses it for better-than-nominal frequency decisions.

Server perf/W story

In 2024: Graviton 4 delivers ~1.3 × SPECrate perf/W of contemporary Intel Emerald Rapids on integer cloud workloads. That's Arm's central pitch to AWS, Microsoft, Google.

Density advantage

192 × N2 cores in a single socket at 250 W = 1.3 W/core. Same socket x86 tops out at ~128 cores at 400 W = ~3.1 W/core. That is why 1U Arm servers are replacing 2U x86 servers for stateless workloads.

15

Performance Counters — How to Actually Measure

  • Linux perf + perf record -e work on Neoverse just like x86. Core events selected by Arm standard numbers.
  • Must-know events on Neoverse:
    • cpu_cycles, inst_retired
    • stall_backend_mem — memory-system stalls (CMN / LLC misses)
    • stall_frontend — I-cache or BP misses
    • br_mis_pred_retired
    • l2d_cache_refill / l3d_cache_refill
  • SPEperf record -e arm_spe// records per-instruction samples with latency + attribution.
  • AmperePerf, NVIDIA NsightCS — tooling layered on top of PMU + SPE.

Topdown on Neoverse

v8.8-A adds Intel-style "Topdown" PMU events. Linux perf stat --topdown breaks cycles into: Retiring / Bad Speculation / Frontend Bound / Backend Bound / Memory Bound — same model as Intel VTune.

Arm Performance Libraries

ArmPL, ACfL (Arm Compiler for Linux), NVIDIA HPC SDK all include FFT / BLAS tuned per Neoverse generation. Autotuning dispatches to N1 / V1 / N2 / V2 paths.

16

Lessons

  • "Why is V1 SVE 2×256 but V2 is 4×128?" → same total FLOPs, different dispatch shape. V1 favours long-vector HPC (BLAS); V2 favours short-vector dispatch density (cloud + DB).
  • "What makes Neoverse different from Cortex-A76?" → same core plus RAS, larger TLBs, MPAM, CHI mesh egress, server-tuned prefetchers, server clock/voltage curves.
  • "Why SMT on V but not N?" → HPC + DB workloads memory-stall; SMT hides that. Cloud scale-out is already parallel across cores; SMT just causes cache pressure.
  • "How do LSE atomics scale better than LDXR/STXR?" → RMW executed at the home node (HN-F) in CHI; no cache line transfer to the requester. Scales to 128+ cores.
  • "What is MPAM used for?" → multi-tenant cache + bandwidth QoS. Linux resctrl-style. Cloud operators stop noisy-neighbour VMs.
  • "What is poison in RAS?" → when an ECC error is uncorrectable, the cache line carries a 1-bit flag. Process only gets SError if/when it reads that exact line. Contained, not fatal.
  • "Graviton 2 vs 3 — main differences?" → N1 → V1 (SVE + SMT2), DDR4 → DDR5, PCIe 4 → 5, chiplet packaging, ~25-30% better SPECint.
17

References

Arm Ltd. — per-core Technical Reference Manuals (N1, V1, N2, V2, N3, V3)
Arm Ltd.Neoverse Performance Analysis Methodology (DAI 0598)
Arm Ltd.MPAM system architecture (DDI 0598, DDI 0601)
Arm Ltd.RAS system architecture (Reliability, Availability, Serviceability)
AWS Annapurna — Graviton 2/3/4 Hot Chips talks
NVIDIAGrace Hopper Superchip architecture whitepaper (2023)
Chipsandcheese.com — microarchitecture articles on Graviton 3, Neoverse V2
Linux kernelarch/arm64/ + Documentation/arch/arm64/, Documentation/arch/arm64/perf.rst

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.