RAS everywhere — ECC on L1/L2/L3, poison handling, containable aborts, SDEI for fatal-error signalling.
Larger TLBs + TLB hierarchy — servers touch much larger working sets; N2 has ~2-3× the L2 TLB of Cortex-A710.
Server-tuned prefetchers — favour large strides, DRAM-bandwidth-aware throttling.
CHI-B (N2+) egress — not DSU: each core sits on a CMN mesh tile directly.
MPAM v1 — allocate cache ways + memory bandwidth per partition ID. Mandatory for multi-tenant cloud.
Clock/voltage optimised for 2.5-3.3 GHz sustained — not 3.5+ GHz burst. Better perf/W at scale.
Why not just use Cortex-A?
A phone Cortex-X runs at 3.5 GHz for 200 ms and thermally throttles. A server Neoverse runs at 3.0 GHz continuously for 5 years. That requires very different silicon validation, leakage budgets, and RAS coverage — worth a separate product line.
Area ≠ wasted
Extra TLB + RAS + MPAM makes Neoverse ~15-20% larger than its Cortex-X cousin. At datacentre scale, that area is recovered 10× by better availability.
First silicon: AWS Graviton 3 (Nov 2021) — 64 cores, 7-chiplet design, DDR5-4800, PCIe Gen 5.
SVE 256-bit vs 4×128
Same total FLOPs, but 2 × 256-bit is better for 256-bit-wide loads / stores and HPC kernels that naturally fit 256-bit. 4 × 128-bit gives more dispatch flexibility for cloud DB workloads. V1 chose wide-and-few; V2 chose narrow-and-many.
Graviton 3's chiplet trick
7 compute chiplets × ~9 cores + 1 I/O die, connected on an Amazon-designed organic interposer. Similar idea to AMD's Epyc. Cost advantage — yield better on 7 smaller dies than 1 huge die.
05
Neoverse N2 — the Armv9 Workhorse
Based on Cortex-A710 (mobile flagship, v9-A). 5-wide decode, 8-wide issue, 4 × 128-bit SVE2.
Mandatory SVE2 — first Neoverse with vector-length-agnostic ops.
Armv9-A features: RME / CCA (confidential compute), BTI, MTE.
Sweet spot: 64-128 cores per socket at 2.7-3.2 GHz, DDR5 / LPDDR5X, CXL 2.0.
Canonical silicon: Microsoft Cobalt 100 (128 cores, Azure), Alibaba Yitian 710 (128 cores, 2021 early adopter), AWS Graviton 4 (96 × N2 + some V2 variants).
Feature
N1
N2
Arch
v8.2-A
v9.0-A
SIMD
NEON 2×128
SVE2 4×128
BTI / MTE
No
Yes
RME / CCA
No
Yes
MPAM
-
v1 mandatory
Mesh
CMN-600
CMN-700
Peak SPECint/core
~20
~30
06
Neoverse V2 — Grace Inside
Based on Cortex-X3. 6-wide decode. Biggest IPC jump in Neoverse history at the time (~35% SPECint over V1).
SIMD: 4 × 128-bit SVE2 (not 2 × 256 like V1) — better for dispatch-limited code, matches V1 in total FLOPs.
SMT2 retained.
Larger ROB (~384 entries), refined TAGE predictor.
Private L2 up to 2 MB, 8-way.
Canonical silicon: NVIDIA Grace CPU — 72 × V2 cores per chiplet, 480 GB LPDDR5X on-package, NVLink-C2C to Hopper/Blackwell GPU at 900 GB/s coherent.
Grace + Hopper
The CPU-GPU "superchip" pairs V2 Grace CPU with Hopper/Blackwell GPU in one board. NVLink-C2C makes GPU see CPU LPDDR5X coherently — no PCIe. Opens door to ML models that spill to host memory transparently.
HPC scaling
Grace in pure CPU mode (72+72 = 144 cores, 960 GB LPDDR5X) targets AI training storage nodes and in-memory databases. Shipping in HPE Slingshot, Eviden BullSequana, several European HPC centres.
07
Neoverse N3 / V3 — 2024 Refresh
N3 — successor to N2. ~20% IPC gain, ~25% energy reduction.
Ported A720-class core
Improved BF16/INT8 matmul path — "CPU AI inference" story
4 × 128-bit SVE2 retained
V3 — successor to V2. ~35% SPECint gain, bigger per-core resources.
New "data-dependent prefetcher" — latency-hiding for pointer-chasing workloads (DB B-trees)
CCA hardware acceleration — near-zero overhead for Realms
Delivered as CSS N3 / CSS V3 — pre-integrated with CMN S3 mesh + GIC-700 + SMMU-700.
CPU AI inference
N3's bf16/INT8 matmul (BFMMLA, SMMLA) gives ~2-3× LLM-token/sec compared to N2 at similar power. The pitch: avoid a separate NPU for small-model inference, keep everything on the CPU.
V3's prefetcher
Picks up pointer-chase patterns (linked list, B-tree traversal). Not just stride-based. Arm reports ~15% Postgres and Memcached improvements from this feature alone.
08
SMT2 in Detail
V1 and V2 (and V3) implement SMT2 — two hardware threads per physical core.
Each thread has its own:
Architectural register file
Rename map
RSB
PMU counters
Shared:
Execution units
L1/L2 caches
Branch predictor tables
TLBs
Each thread presents as a separate PE to the OS, with its own MPIDR Aff0.
When SMT helps
Memory-bound, latency-tolerant workloads — DB joins, in-memory KV, nginx request handling. One thread stalls on L3 miss while the other keeps the EUs busy. Typical gains: 15-30% throughput at ~5% per-thread latency cost.
When SMT hurts
Branch-heavy single-thread workloads (compression, ray tracing) — two threads thrash branch-predictor tables. HPC codes that fit comfortably in L1/L2 see cache-thrashing losses. Often configured off for pure HPC.
09
RAS — Reliability, Availability, Serviceability
Servers run 24/7 for years. RAS features turn "crash" into "detected, contained, reported."
Arm RAS architecture (v8.2-A onward) defines a standard framework:
Poison propagation: bad cache line carries a 1-bit "poison" until consumed
Neoverse cores implement ECC on:
L1 D-cache, L2, L3 tags + data
TLB entries
RF (parity on register file)
Memory scrubbing at CMN / memory controller level.
Poison — the real trick
When a cache line takes an uncorrectable error, the CPU doesn't kill the system. It marks the line as poisoned. The OS/firmware only needs to deal with it when (and if) a process actually reads from that exact line.
SDEI
Software Delegated Exception Interface — an EL3 firmware service that forwards RAS events to EL2/EL1 asynchronously. Lets Linux log + quarantine a faulty page without taking the whole kernel down.
10
MPAM — Memory Partitioning & Monitoring
Armv8.4-A optional, mandatory from Neoverse N2.
Each memory access is tagged with a PartID (partition ID) + PMG (monitoring group ID).
At each MPAM-aware point (L2, L3, CMN, memory controller), resources are partitioned by PartID:
Cache way/capacity allocation (cache-set like Intel CAT)
Bandwidth quotas (min / max BW per partition)
Monitoring counters for observed use
OS (Linux resctrl) or hypervisor assigns PartIDs per process / VM.
The enabler for multi-tenant cloud QoS — noisy-neighbour workloads can be capped.
// Linux MPAM (resctrl mount)
// Create a partition for tenant A
mount -t resctrl resctrl /sys/fs/resctrl
mkdir /sys/fs/resctrl/tenantA
echo "L3:0=00ff00;MB:0=50" \
> /sys/fs/resctrl/tenantA/schemata
// L3 mask = 00ff00 → 8 ways allocated
// MB = 50 → 50% of memory bw
// CPUs 4-7 belong to this partition
echo "4-7" > /sys/fs/resctrl/tenantA/cpus
MPAM was designed to match Intel CAT/CDP/MBA — same hyperscaler-friendly partition model, Arm-architectural instead of Intel-proprietary.
11
Neoverse Comparison Table
Param
N1
V1
E1
N2
V2
N3
V3
Year
2019
2021
2020
2021
2022
2024
2024
Arch
v8.2
v8.4
v8.2
v9.0
v9.0
v9.2
v9.2
Decode
4
5
2
5
6
5
10
Issue
8
8
3
8
8
8
10+
SIMD
NEON
2×256 SVE
NEON
4×128 SVE2
4×128 SVE2
4×128 SVE2
4×128 SVE2
SMT
-
2
2
-
2
-
2
Private L2
1 MB
2 MB
128 KB
1 MB
2 MB
1-2 MB
2-3 MB
Max cores / socket
80-128
64
-
128
72-144
192+
96
Mesh
CMN-600
CMN-650/700
CMN-600
CMN-700
CMN-700
CMN S3
CMN S3
"E1 decode=2" reflects its in-order dual-issue design. Not shown: custom cores (Ampere AmpereOne A192 uses 192 in-house cores; Apple Silicon is outside the Neoverse family entirely).
LSE atomics (CAS, LDADD, SWP) in Neoverse are implemented as near-atomics at the HN-F: the home node for that cache line executes the RMW on behalf of the requester, without transferring the line.
This scales to 128+ cores with near-zero tail latency — the hot counter / lock simply queues at the HN-F.
Linux -moutline-atomics flag (glibc 2.32+) compiles binaries that pick LSE at runtime.
On N1+, LSE cuts a mutex-heavy benchmark (Memcached, Redis SET heavy) by 20-40% versus the LDXR/STXR equivalent.
// LSE atomic vs legacy LDXR/STXR
// Legacy — can livelock
retry:
ldxr w1, [x0]
add w1, w1, #1
stxr w2, w1, [x0]
cbnz w2, retry
// LSE (Neoverse) — single op, HN-F does it
ldadd w3, w1, [x0] // atomic fetch-add
// round-trip: 1 REQ → HN-F, 1 RSP back
// no need to own line in Unique state
The CHI protocol at the HN-F has dedicated atomic RMW opcodes; the home node's cache controller executes them without the requester ever taking the line.
14
Clock, Voltage, Power Plans
Typical sustained clock: 2.5 – 3.3 GHz. Cortex-X can hit 3.5+ GHz on mobile thanks to bursty workload; servers want sustained & thermally stable.
Per-core DVFS domains (N1+). Idle cores drop to very low power.
Graviton 3 reports ~60 W TDP for 64 cores (~1 W/core at full load). Xeon 6430 at 64 cores: ~250 W.
Power-state hints:
WFI / WFE / WFIT / WFET
PSCI CPU_SUSPEND
Per-core clock gating
L3 way power-down on idle
AMU (Activity Monitors v8.4-A) gives the OS a hardware view of "busy cycles" per core; Linux cpufreq uses it for better-than-nominal frequency decisions.
Server perf/W story
In 2024: Graviton 4 delivers ~1.3 × SPECrate perf/W of contemporary Intel Emerald Rapids on integer cloud workloads. That's Arm's central pitch to AWS, Microsoft, Google.
Density advantage
192 × N2 cores in a single socket at 250 W = 1.3 W/core. Same socket x86 tops out at ~128 cores at 400 W = ~3.1 W/core. That is why 1U Arm servers are replacing 2U x86 servers for stateless workloads.
15
Performance Counters — How to Actually Measure
Linux perf + perf record -e work on Neoverse just like x86. Core events selected by Arm standard numbers.
SPE — perf record -e arm_spe// records per-instruction samples with latency + attribution.
AmperePerf, NVIDIA NsightCS — tooling layered on top of PMU + SPE.
Topdown on Neoverse
v8.8-A adds Intel-style "Topdown" PMU events. Linux perf stat --topdown breaks cycles into: Retiring / Bad Speculation / Frontend Bound / Backend Bound / Memory Bound — same model as Intel VTune.
Arm Performance Libraries
ArmPL, ACfL (Arm Compiler for Linux), NVIDIA HPC SDK all include FFT / BLAS tuned per Neoverse generation. Autotuning dispatches to N1 / V1 / N2 / V2 paths.
16
Lessons
"Why is V1 SVE 2×256 but V2 is 4×128?" → same total FLOPs, different dispatch shape. V1 favours long-vector HPC (BLAS); V2 favours short-vector dispatch density (cloud + DB).
"What makes Neoverse different from Cortex-A76?" → same core plus RAS, larger TLBs, MPAM, CHI mesh egress, server-tuned prefetchers, server clock/voltage curves.
"Why SMT on V but not N?" → HPC + DB workloads memory-stall; SMT hides that. Cloud scale-out is already parallel across cores; SMT just causes cache pressure.
"How do LSE atomics scale better than LDXR/STXR?" → RMW executed at the home node (HN-F) in CHI; no cache line transfer to the requester. Scales to 128+ cores.
"What is MPAM used for?" → multi-tenant cache + bandwidth QoS. Linux resctrl-style. Cloud operators stop noisy-neighbour VMs.
"What is poison in RAS?" → when an ECC error is uncorrectable, the cache line carries a 1-bit flag. Process only gets SError if/when it reads that exact line. Contained, not fatal.