ARM NEOVERSE · PRESENTATION 02

Microarchitecture & Core Details

What Neoverse cores add on top of Cortex-X/-A · N vs V vs E

N1 · V1 · N2 · V2 · N3 · V3 · SMT2 · RAS · MPAM · ROBs · Prefetch · LSE on mesh

What Changes When Cortex Goes Server

RAS everywhere — ECC on L1/L2/L3, poison handling, containable aborts, SDEI for fatal-error signalling.
Larger TLBs + TLB hierarchy — servers touch much larger working sets; N2 has ~2-3× the L2 TLB of Cortex-A710.
Server-tuned prefetchers — favour large strides, DRAM-bandwidth-aware throttling.
CHI-B (N2+) egress — not DSU: each core sits on a CMN mesh tile directly.
MPAM v1 — allocate cache ways + memory bandwidth per partition ID. Mandatory for multi-tenant cloud.
Clock/voltage optimised for 2.5-3.3 GHz sustained — not 3.5+ GHz burst. Better perf/W at scale.

Why not just use Cortex-A?

A phone Cortex-X runs at 3.5 GHz for 200 ms and thermally throttles. A server Neoverse runs at 3.0 GHz continuously for 5 years. That requires very different silicon validation, leakage budgets, and RAS coverage — worth a separate product line.

Area ≠ wasted

Extra TLB + RAS + MPAM makes Neoverse ~15-20% larger than its Cortex-X cousin. At datacentre scale, that area is recovered 10× by better availability.

Neoverse N1 — Under the Hood

Parameter	Value
Architecture	Armv8.2-A (no SVE)
Decode / Dispatch	4-wide / 4-wide
Issue width	8 (2 ALU + MAC, 2 load/store, 2 FP/SIMD, 1 branch, 1 DIV)
Pipeline depth	~11 stages (front) + variable back
ROB / renamed regs	~128 / ~200 int, ~150 FP
L1-I / L1-D	64 KB / 64 KB, 4-way
L2 (private)	256 KB – 1 MB, 8-way
L3 / SLC	via CMN-600, 1-2 MB/core
SIMD	NEON 2 × 128-bit pipes
SMT	— (single-thread only)
Target clock	2.5 – 3.0 GHz
Process	7 nm TSMC (Graviton 2)

Neoverse V1 — the Wide / SVE Core

Based on Cortex-X1. 5-wide decode, 8-wide issue, deeper OoO.
SVE: 2 × 256-bit FP/int pipes (twice the vector throughput of a 2 × 128-bit Neoverse V2). Best for dense FP kernels.
SMT2 — two hardware threads per core. First Arm IP core with SMT.
Larger ROB (~320 entries), richer prefetcher.
Private L2 up to 2 MB, 8-way.
bf16 / INT8 dot-product / matmul — v8.6-A extensions targeting ML.
First silicon: AWS Graviton 3 (Nov 2021) — 64 cores, 7-chiplet design, DDR5-4800, PCIe Gen 5.

SVE 256-bit vs 4×128

Same total FLOPs, but 2 × 256-bit is better for 256-bit-wide loads / stores and HPC kernels that naturally fit 256-bit. 4 × 128-bit gives more dispatch flexibility for cloud DB workloads. V1 chose wide-and-few; V2 chose narrow-and-many.

Graviton 3's chiplet trick

7 compute chiplets × ~9 cores + 1 I/O die, connected on an Amazon-designed organic interposer. Similar idea to AMD's Epyc. Cost advantage — yield better on 7 smaller dies than 1 huge die.

Neoverse N2 — the Armv9 Workhorse

Based on Cortex-A710 (mobile flagship, v9-A). 5-wide decode, 8-wide issue, 4 × 128-bit SVE2.
Mandatory SVE2 — first Neoverse with vector-length-agnostic ops.
Armv9-A features: RME / CCA (confidential compute), BTI, MTE.
Private L2 up to 1 MB. CHI-B mesh egress.
MPAM v1, SPE (statistical profiling), ETE/TRBE (embedded trace).
Sweet spot: 64-128 cores per socket at 2.7-3.2 GHz, DDR5 / LPDDR5X, CXL 2.0.
Canonical silicon: Microsoft Cobalt 100 (128 cores, Azure), Alibaba Yitian 710 (128 cores, 2021 early adopter), AWS Graviton 4 (96 × N2 + some V2 variants).

Feature	N1	N2
Arch	v8.2-A	v9.0-A
SIMD	NEON 2×128	SVE2 4×128
BTI / MTE	No	Yes
RME / CCA	No	Yes
MPAM	-	v1 mandatory
Mesh	CMN-600	CMN-700
Peak SPECint/core	~20	~30

Neoverse V2 — Grace Inside

Based on Cortex-X3. 6-wide decode. Biggest IPC jump in Neoverse history at the time (~35% SPECint over V1).
SIMD: 4 × 128-bit SVE2 (not 2 × 256 like V1) — better for dispatch-limited code, matches V1 in total FLOPs.
SMT2 retained.
Larger ROB (~384 entries), refined TAGE predictor.
Private L2 up to 2 MB, 8-way.
Canonical silicon: NVIDIA Grace CPU — 72 × V2 cores per chiplet, 480 GB LPDDR5X on-package, NVLink-C2C to Hopper/Blackwell GPU at 900 GB/s coherent.

Grace + Hopper

The CPU-GPU "superchip" pairs V2 Grace CPU with Hopper/Blackwell GPU in one board. NVLink-C2C makes GPU see CPU LPDDR5X coherently — no PCIe. Opens door to ML models that spill to host memory transparently.

HPC scaling

Grace in pure CPU mode (72+72 = 144 cores, 960 GB LPDDR5X) targets AI training storage nodes and in-memory databases. Shipping in HPE Slingshot, Eviden BullSequana, several European HPC centres.

Neoverse N3 / V3 — 2024 Refresh

N3 — successor to N2. ~20% IPC gain, ~25% energy reduction.
- Ported A720-class core
- Improved BF16/INT8 matmul path — "CPU AI inference" story
- 4 × 128-bit SVE2 retained
V3 — successor to V2. ~35% SPECint gain, bigger per-core resources.
- X4-class front-end, ~384-entry ROB, wider dispatch
- New "data-dependent prefetcher" — latency-hiding for pointer-chasing workloads (DB B-trees)
- CCA hardware acceleration — near-zero overhead for Realms
Delivered as CSS N3 / CSS V3 — pre-integrated with CMN S3 mesh + GIC-700 + SMMU-700.

CPU AI inference

N3's bf16/INT8 matmul (BFMMLA, SMMLA) gives ~2-3× LLM-token/sec compared to N2 at similar power. The pitch: avoid a separate NPU for small-model inference, keep everything on the CPU.

V3's prefetcher

Picks up pointer-chase patterns (linked list, B-tree traversal). Not just stride-based. Arm reports ~15% Postgres and Memcached improvements from this feature alone.

SMT2 in Detail

V1 and V2 (and V3) implement SMT2 — two hardware threads per physical core.
Each thread has its own:
- Architectural register file
- Rename map
- RSB
- PMU counters
Shared:
- Execution units
- L1/L2 caches
- Branch predictor tables
- TLBs
Each thread presents as a separate PE to the OS, with its own MPIDR Aff0.

When SMT helps

Memory-bound, latency-tolerant workloads — DB joins, in-memory KV, nginx request handling. One thread stalls on L3 miss while the other keeps the EUs busy. Typical gains: 15-30% throughput at ~5% per-thread latency cost.

When SMT hurts

Branch-heavy single-thread workloads (compression, ray tracing) — two threads thrash branch-predictor tables. HPC codes that fit comfortably in L1/L2 see cache-thrashing losses. Often configured off for pure HPC.

RAS — Reliability, Availability, Serviceability

Servers run 24/7 for years. RAS features turn "crash" into "detected, contained, reported."
Arm RAS architecture (v8.2-A onward) defines a standard framework:
- Error records per PE (ERXADDR_EL1, ERXSTATUS_EL1)
- Severity levels: UEC (uncorrected, contained), UER (recoverable), UEO (outer), DE (deferred)
- SError + SDEI signalling to EL3 / EL2 / EL1
- Poison propagation: bad cache line carries a 1-bit "poison" until consumed
Neoverse cores implement ECC on:
- L1 D-cache, L2, L3 tags + data
- TLB entries
- RF (parity on register file)
Memory scrubbing at CMN / memory controller level.

Poison — the real trick

When a cache line takes an uncorrectable error, the CPU doesn't kill the system. It marks the line as poisoned. The OS/firmware only needs to deal with it when (and if) a process actually reads from that exact line.

SDEI

Software Delegated Exception Interface — an EL3 firmware service that forwards RAS events to EL2/EL1 asynchronously. Lets Linux log + quarantine a faulty page without taking the whole kernel down.

MPAM — Memory Partitioning & Monitoring

Armv8.4-A optional, mandatory from Neoverse N2.
Each memory access is tagged with a PartID (partition ID) + PMG (monitoring group ID).
At each MPAM-aware point (L2, L3, CMN, memory controller), resources are partitioned by PartID:
- Cache way/capacity allocation (cache-set like Intel CAT)
- Bandwidth quotas (min / max BW per partition)
- Monitoring counters for observed use
OS (Linux resctrl) or hypervisor assigns PartIDs per process / VM.
The enabler for multi-tenant cloud QoS — noisy-neighbour workloads can be capped.

// Linux MPAM (resctrl mount)
// Create a partition for tenant A
mount -t resctrl resctrl /sys/fs/resctrl

mkdir /sys/fs/resctrl/tenantA

echo "L3:0=00ff00;MB:0=50" \
     > /sys/fs/resctrl/tenantA/schemata

// L3 mask = 00ff00 → 8 ways allocated
// MB     = 50     → 50% of memory bw
// CPUs 4-7 belong to this partition
echo "4-7" > /sys/fs/resctrl/tenantA/cpus

MPAM was designed to match Intel CAT/CDP/MBA — same hyperscaler-friendly partition model, Arm-architectural instead of Intel-proprietary.

Neoverse Comparison Table

Param	N1	V1	E1	N2	V2	N3	V3
Year	2019	2021	2020	2021	2022	2024	2024
Arch	v8.2	v8.4	v8.2	v9.0	v9.0	v9.2	v9.2
Decode	4	5	2	5	6	5	10
Issue	8	8	3	8	8	8	10+
SIMD	NEON	2×256 SVE	NEON	4×128 SVE2	4×128 SVE2	4×128 SVE2	4×128 SVE2
SMT	-	2	2	-	2	-	2
Private L2	1 MB	2 MB	128 KB	1 MB	2 MB	1-2 MB	2-3 MB
Max cores / socket	80-128	64	-	128	72-144	192+	96
Mesh	CMN-600	CMN-650/700	CMN-600	CMN-700	CMN-700	CMN S3	CMN S3

"E1 decode=2" reflects its in-order dual-issue design. Not shown: custom cores (Ampere AmpereOne A192 uses 192 in-house cores; Apple Silicon is outside the Neoverse family entirely).

Cache Hierarchy on a Neoverse Socket

L1 — 4-way, VIPT tag, 4-cycle hit.
L2 — 8-way, private, 8-12 cycle hit. ECC single-bit correct, double-bit detect.
Mesh / SLC — distributed across HN-F home nodes. Each hop ~3-5 cycles. 48-80 cycle miss-to-DRAM latency total.
DRAM — 8-12 ch DDR5 4800/5600 typical, up to 12 ch DDR5-6000 on Graviton 4 and Cobalt 100. HBM3 on Grace.

LSE Atomics on a Mesh

LDXR/STXR exclusive-monitor atomics scale poorly beyond ~16 cores — multiple exclusive locks ping-pong, livelock can ensue.
LSE atomics (CAS, LDADD, SWP) in Neoverse are implemented as near-atomics at the HN-F: the home node for that cache line executes the RMW on behalf of the requester, without transferring the line.
This scales to 128+ cores with near-zero tail latency — the hot counter / lock simply queues at the HN-F.
Linux -moutline-atomics flag (glibc 2.32+) compiles binaries that pick LSE at runtime.
On N1+, LSE cuts a mutex-heavy benchmark (Memcached, Redis SET heavy) by 20-40% versus the LDXR/STXR equivalent.

// LSE atomic vs legacy LDXR/STXR

// Legacy — can livelock
retry:
    ldxr   w1, [x0]
    add    w1, w1, #1
    stxr   w2, w1, [x0]
    cbnz   w2, retry

// LSE (Neoverse) — single op, HN-F does it
    ldadd  w3, w1, [x0]      // atomic fetch-add
// round-trip: 1 REQ → HN-F, 1 RSP back
// no need to own line in Unique state

The CHI protocol at the HN-F has dedicated atomic RMW opcodes; the home node's cache controller executes them without the requester ever taking the line.

Clock, Voltage, Power Plans

Typical sustained clock: 2.5 – 3.3 GHz. Cortex-X can hit 3.5+ GHz on mobile thanks to bursty workload; servers want sustained & thermally stable.
Per-core DVFS domains (N1+). Idle cores drop to very low power.
Graviton 3 reports ~60 W TDP for 64 cores (~1 W/core at full load). Xeon 6430 at 64 cores: ~250 W.
Power-state hints:
- WFI / WFE / WFIT / WFET
- PSCI CPU_SUSPEND
- Per-core clock gating
- L3 way power-down on idle
AMU (Activity Monitors v8.4-A) gives the OS a hardware view of "busy cycles" per core; Linux cpufreq uses it for better-than-nominal frequency decisions.

Server perf/W story

In 2024: Graviton 4 delivers ~1.3 × SPECrate perf/W of contemporary Intel Emerald Rapids on integer cloud workloads. That's Arm's central pitch to AWS, Microsoft, Google.

Density advantage

192 × N2 cores in a single socket at 250 W = 1.3 W/core. Same socket x86 tops out at ~128 cores at 400 W = ~3.1 W/core. That is why 1U Arm servers are replacing 2U x86 servers for stateless workloads.

Performance Counters — How to Actually Measure

Linux perf + perf record -e work on Neoverse just like x86. Core events selected by Arm standard numbers.
Must-know events on Neoverse:
- cpu_cycles, inst_retired
- stall_backend_mem — memory-system stalls (CMN / LLC misses)
- stall_frontend — I-cache or BP misses
- br_mis_pred_retired
- l2d_cache_refill / l3d_cache_refill
SPE — perf record -e arm_spe// records per-instruction samples with latency + attribution.
AmperePerf, NVIDIA NsightCS — tooling layered on top of PMU + SPE.

Topdown on Neoverse

v8.8-A adds Intel-style "Topdown" PMU events. Linux perf stat --topdown breaks cycles into: Retiring / Bad Speculation / Frontend Bound / Backend Bound / Memory Bound — same model as Intel VTune.

Arm Performance Libraries

ArmPL, ACfL (Arm Compiler for Linux), NVIDIA HPC SDK all include FFT / BLAS tuned per Neoverse generation. Autotuning dispatches to N1 / V1 / N2 / V2 paths.

Lessons

"Why is V1 SVE 2×256 but V2 is 4×128?" → same total FLOPs, different dispatch shape. V1 favours long-vector HPC (BLAS); V2 favours short-vector dispatch density (cloud + DB).
"What makes Neoverse different from Cortex-A76?" → same core plus RAS, larger TLBs, MPAM, CHI mesh egress, server-tuned prefetchers, server clock/voltage curves.
"Why SMT on V but not N?" → HPC + DB workloads memory-stall; SMT hides that. Cloud scale-out is already parallel across cores; SMT just causes cache pressure.

"How do LSE atomics scale better than LDXR/STXR?" → RMW executed at the home node (HN-F) in CHI; no cache line transfer to the requester. Scales to 128+ cores.
"What is MPAM used for?" → multi-tenant cache + bandwidth QoS. Linux resctrl-style. Cloud operators stop noisy-neighbour VMs.
"What is poison in RAS?" → when an ECC error is uncorrectable, the cache line carries a 1-bit flag. Process only gets SError if/when it reads that exact line. Contained, not fatal.
"Graviton 2 vs 3 — main differences?" → N1 → V1 (SVE + SMT2), DDR4 → DDR5, PCIe 4 → 5, chiplet packaging, ~25-30% better SPECint.

References

Arm Ltd. — per-core Technical Reference Manuals (N1, V1, N2, V2, N3, V3)
Arm Ltd. — Neoverse Performance Analysis Methodology (DAI 0598)
Arm Ltd. — MPAM system architecture (DDI 0598, DDI 0601)
Arm Ltd. — RAS system architecture (Reliability, Availability, Serviceability)
AWS Annapurna — Graviton 2/3/4 Hot Chips talks
NVIDIA — Grace Hopper Superchip architecture whitepaper (2023)
Chipsandcheese.com — microarchitecture articles on Graviton 3, Neoverse V2
Linux kernel — arch/arm64/ + Documentation/arch/arm64/, Documentation/arch/arm64/perf.rst

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.