ARM NEOVERSE · PRESENTATION 01

History & Product Lines

From "Cortex-A for servers" to a distinct infrastructure brand

Cosmos · Ares · Zeus · Poseidon · N1 · V1 · E1 · N2 · V2 · N3 · V3 · CSS

Prehistory — Arm in the Server Room (2012-2018)

Calxeda EnergyCore (2011-13) — first serious Arm server SoC (Cortex-A9, fabric-attached). Too early; Calxeda shut down in 2013.
AMD Opteron A1100 "Seattle" (2016) — 8 × Cortex-A57. AMD's first and last attempt at Arm server; cancelled after Zen prioritisation.
Qualcomm Centriq 2400 (2017) — 48 × custom Falkor cores. Technically excellent, strategically abandoned in 2018.
Marvell ThunderX / ThunderX2 (Cavium) — 32-48 × custom cores, shipped in Cray XC50 + Astra supercomputer.
Ampere eMAG (2018) — 32 × custom X-Gene, the first "real" commodity Arm server silicon.
Every one of these built their own core. No "Neoverse" yet: Arm provided licences but no server-optimised IP.

"The question in 2017 wasn't whether Arm could do a server. It was whether anyone other than Arm's custom licensees could build one." — infrastructure-industry view, paraphrased

Arm's answer: launch Neoverse as a dedicated infrastructure roadmap in October 2018, re-characterising Cortex-A76 for servers and committing to an annual cadence.

The 2018 Neoverse Announcement

October 16, 2018. Allen Technology Conference, Santa Clara.
Arm unveiled three roadmaps under one brand:
- Cosmos — available now (2018). = Cortex-A72 / A75 repackaged for server.
- Ares — 2019. 7 nm, first "Neoverse-native" — became N1.
- Zeus — 2020. First to carry a non-mobile, server-only design. Became V1.
- Poseidon — 2021 roadmap slot, v9-A. Became N2 / V2.
Promise: 30% performance/year improvements — sustained for the first ~5 years (N1→V1→V2 ≈ 50+% per gen on SPEC rates).
Three target profiles announced in 2020:
- N — "balanced" scale-out server cores
- V — "performance" wide cores for HPC
- E — "efficiency" edge/networking cores

Why three profiles?

Mirrors how Arm served mobile with Cortex-A/R/M. A cloud platform wants dozens of N-cores per socket for scale-out; an HPC system wants fewer V-cores with wider SIMD; a 5G baseband or DPU wants E-cores at lowest power.

Annual cadence

Delivered every year since 2019 — N1 (2019), V1/E1 (2020-21), N2 (2021), V2 (2022), N3/V3 (2024). Faster than x86, driven by shared microarchitecture with Cortex-A flagships.

Neoverse N1 (Ares, 2019)

Under the hood: a Cortex-A76 core, re-validated for server use.
4-wide decode, 8-wide issue, 128-bit NEON. No SVE.
Private L2 of up to 1 MB, LLC (via CMN-600 mesh) 1-2 MB/core.
Server-specific:
- RAS: ECC on L1/L2/L3, poison handling
- Single-bit error correction without traps
- Larger TLBs, server-tuned prefetchers
First major deployment: AWS Graviton 2 (November 2019) — 64 cores, 7 nm TSMC. Became the basis of ~50 % of new EC2 launches within 3 years.
Also shipped in: Ampere Altra (80 → 128 cores), Oracle OCI A1 Flex, Alibaba Yitian 710 (2020).

Neoverse V1 (Zeus, 2020)

Derived from Cortex-X1 — Arm's first big-and-wide core. Added SVE for HPC.
2 × 256-bit SVE FP pipelines — widest SVE Arm has shipped (only A64FX's 512-bit SVE1 is wider).
5-wide decode, larger ROB, deeper OoO, more aggressive prefetch.
Added SMT2 — first Neoverse with simultaneous multi-threading; helps on memory-stall workloads at cost of branch-heavy ones.
Private L2 up to 2 MB.
First silicon: AWS Graviton 3 (2022) — 64 cores, 7 cores per chiplet × 7 + IO die.
Also: SiPearl Rhea1 — V1-based European HPC chip for the Jupiter exascale machine (2025).

V1 is the HPC play

HPC and analytics codes vectorise well; Neoverse V1's 2 × 256-bit SVE delivers ~2.5 × the per-core FP throughput of N1 on BLAS-like kernels, with comparable area to a desktop X1.

SMT2 on servers

x86 has run 2-way SMT since Pentium 4. Neoverse V1 brings the same trick to Arm — DB-heavy and in-memory KV workloads see 15-30% throughput lift.

Neoverse E1 (2020) — the Edge Sibling

E = Efficiency. Derived from Cortex-A65AE, an in-order dual-issue core.
Designed for 5G base stations, NICs, smart switches, SD-WAN appliances — lots of packet-processing threads at low power.
Supports SMT2 natively — unusual for an in-order core, but hides network-packet latency.
Per-core: ~10× smaller than an N1, ~1/3 the power.
Shipping in: Marvell Octeon 10 DPU, ASR Microelectronics baseband, various OEM data-processing units (DPUs).
Successor E2 (2024) roadmap slot announced but fewer public shipments — edge silicon goes quietly.

Why the E tier exists

A DPU terminating millions of packets per second doesn't want huge OoO machinery; it wants many in-order SMT threads that can keep a 400 GbE pipe full. N/V-class cores would waste area and power on branch predictors the workload can't use.

Adjacent Arm IP

Ethos-N NPUs and Mali GPUs often sit on the same chip — Arm's "infrastructure CSS" reference designs bundle Neoverse-E with Ethos for combined edge-ML + packet boxes.

Neoverse N2 (Perseus, 2021 — Armv9-A)

Derived from Cortex-A710. First Armv9-A Neoverse. Mandates SVE2.
5-wide decode, 8 × 128-bit SVE2 pipes per core.
Significant server-specific refinements:
- Larger BTB + refined TAGE predictor
- Private L2 up to 1 MB
- Support for MPAM v1 for QoS of last-level cache + memory bandwidth
- CHI-B interconnect upgrade
Confidential Compute (RME / CCA) — first Neoverse with hardware Realms.
Production silicon:
- Microsoft Cobalt 100 — 128 cores, Azure (2024)
- Alibaba Yitian 710 — 128 cores (shipped 2021 — surprise N2 early-bird)
- NVIDIA BlueField-4 DPU (2024)

N = scale out

N2's sweet spot is 64-128 cores per socket at ~2.5-3.5 GHz, DDR5 + CXL 2.0 memory. Beats x86 Zen 4 on cloud-native (nginx, Redis, Java) by 25-45% perf/watt.

V2 (Demeter, 2022)

Paired with N2: derived from Cortex-X3. 4 × 128-bit SVE2 (narrower than V1's 2×256 but same FLOPs). Shipped in NVIDIA Grace CPU (72 cores per chiplet, 2 chiplets = 144, 2024).

Neoverse N3 & V3 (2024) — the CSS Generation

Announced February 2024 at Arm's Neoverse Tech Day.
N3 — successor to N2. Key changes:
- Improved core IPC (~20% over N2 on SPEC)
- Better bf16 / INT8 matmul throughput for AI inference on CPU
- Tightened DVFS curves for lower idle power
V3 — successor to V2. Biggest generation-on-generation jump (~35% SPECint gain over V2):
- Wider front-end, ~384-entry ROB
- 4 × 128-bit SVE2 with higher utilisation
- New data-dependent prefetcher
- CCA acceleration
Delivered as part of Arm Compute Subsystems (CSS) — pre-validated core + CMN + GIC + SMMU drops. Used by Microsoft Cobalt 200 and AWS Graviton 5 (reportedly).

CSS is the "server-on-a-chip starter kit" — customers integrate chiplets on top, saving 1-2 years of IP integration.

Comparing the Families

Generation	Codename	Based on	SVE	Year	Canonical silicon
N1	Ares	Cortex-A76	—	2019	Graviton 2, Ampere Altra, Yitian 710 (early)
E1	Helios	Cortex-A65AE	—	2020	Marvell Octeon 10, 5G basebands
V1	Zeus	Cortex-X1	2 × 256-bit	2021	Graviton 3 / 3E, SiPearl Rhea1
N2	Perseus	Cortex-A710 (v9-A)	4 × 128-bit SVE2	2021	Graviton 4 (partial), Cobalt 100, Yitian 710
V2	Demeter	Cortex-X3 (v9-A)	4 × 128-bit SVE2	2022	NVIDIA Grace (72/144), HPE, planned HPC
N3	—	A720-class (v9.2-A)	4 × 128-bit SVE2	2024	Azure Cobalt 200 (reported)
V3	—	X4-class (v9.2-A)	4 × 128-bit SVE2	2024	Graviton 5 (reported), NVIDIA Grace next-gen

Not shown: custom-core Neoverse "cousins" like Ampere AmpereOne (A192 — 192 custom cores, Armv8.6-A) and Microsoft Cobalt 100 (N2 integration). The V-series trades core count for width; the N-series goes for socket density.

The N vs V Tradeoff

E-series — low IPC, high thread count, very low power. DPUs, 5G, edge.
N-series — middling IPC, very high core counts per socket (up to 192). Cloud scale-out.
V-series — wide OoO + wide SIMD for single-thread + HPC; fewer cores per socket (48-96), SMT2.
Between customers: AWS uses N for cloud, V for HPC. Microsoft uses N for Azure. NVIDIA uses V for Grace (HPC + AI).

Timeline — 2012 to 2024

2012

●

Calxeda EnergyCore (Cortex-A9): first serious Arm server — too early

2016

●

AMD Opteron A1100 (A57) ships then dies; Arm-Cavium ThunderX1 finds scientific use

2017

●

Qualcomm Centriq 2400 (Falkor custom); cancelled 2018

2018

●

Neoverse brand launched — Cosmos / Ares / Zeus / Poseidon roadmap

2019

●

Neoverse N1 (Ares) + AWS Graviton 2 — the turning point

2020

●

Ampere Altra (80-cores N1); Neoverse E1 for edge

2021

●

Neoverse V1 (Zeus, SVE 2×256b) + first-mover SiPearl Rhea; Alibaba Yitian 710 ships N2 early

2022

●

Graviton 3 (V1 + DDR5 + PCIe Gen 5); Neoverse N2 (Perseus) generally available

2022

●

Neoverse V2 (Demeter) + NVIDIA Grace unveiled

2023

●

AmpereOne (custom-core "Siryn") ships; Arm IPO on NASDAQ

2024

●

Neoverse N3 / V3; Azure Cobalt 100 (N2) + AWS Graviton 4; Arm Total Design + CSS IP delivery

Why Neoverse Succeeded Where Earlier Attempts Failed

Shared microarchitecture with Cortex-A flagships — amortises R&D. Each Cortex-X/-A generation becomes a Neoverse within ~1 year.
Perf/Watt at hyperscale — AWS, Microsoft, Google all face power-constrained datacentres. Neoverse N1/N2 at 2.5-3 W/core vs Xeon's 5-8 W/core is decisive.
Software ecosystem caught up — Linux, JVM, Go, .NET, PyTorch, Postgres, Redis, Kafka all have first-class AArch64 builds since 2020.
SystemReady — a Linux distro boots on any compliant server without vendor-specific patches. Finally matched x86's UEFI/ACPI experience.
Hyperscaler self-sufficiency — AWS, Microsoft, Google now design their own Arm silicon, cutting out the middleman. Neoverse IP licences + Arm's CSS makes this practical.

Graviton as Arm's proof-point

AWS reports Graviton now accounts for >50% of new EC2 capacity. Customers see 20-40% price/perf improvement over Intel/AMD. Hard to argue with the datacentre P&L.

Open-source alignment

Arm backed Linaro + TF-A + EDK2 + Tianocore to make sure the full server stack was permissively licensed. Removed a big barrier for hyperscalers.

Neoverse CSS — the 2024 Shift

Compute Sub-System (CSS) — a pre-integrated block of cores + DSU-like cluster + CMN slice + system IP (GIC, SMMU, CoreSight).
Customers get a validated RTL drop ready to tile into a chiplet or SoC. Saves 12-24 months.
First CSS: CSS N2 (64 cores in a mesh), delivered late 2023. Azure Cobalt 100 built on it.
Follow-on: CSS V3 / N3 (2024). Some variants offer optional bundled HBM3 / LPDDR5X memory controllers.
Part of "Arm Total Design" — 30+ partner ecosystem (TSMC, Samsung Foundry, Cadence, Synopsys, Siemens EDA) aligned to help customers build chiplets.

Why CSS exists

Integrating a full Neoverse mesh is hard — physical design, RTL signoff, ISO 26262/RAS. Only hyperscalers with 100+ engineer silicon teams could afford it. CSS democratises access to Neoverse for companies with much smaller teams.

The chiplet angle

CSS maps cleanly onto a single chiplet, making Neoverse the "compute" side of UCIe-based multi-chiplet SoCs. Memory / IO / AI accelerators live on complementary chiplets.

Relationship to Cortex-A

Every shipping Neoverse is based on a Cortex-A/X flagship. But they are not identical — server re-validation adds:
- RAS features (ECC, poison, lockstep support)
- Larger TLBs
- Mesh-style CHI egress (AMBA 5)
- MPAM v1 QoS
- Server-tuned prefetchers & branch predictors
- Tuning for 2-3 GHz sustained vs 3.5+ GHz burst
The packaging is very different: Cortex-A ships as an IP drop to phone OEMs with DSU; Neoverse ships to silicon OEMs with CMN integration.

Fork timing

Typical pattern: Cortex-X/-A unveiled at Computex/May. Matching Neoverse unveiled at Arm Neoverse Tech Day (Oct-Feb) 6-12 months later. That's the validation + RAS window.

Benchmark flavours

Phone cores aim for Geekbench ST. Neoverse aims for SPECrate, STREAM, DGEMM, DB/nginx/Kafka rps. Different prefetcher tuning per target.

Lessons

"What's the difference between Neoverse N and V?" → N is scale-out (many cores, modest IPC). V is performance (wide OoO, wide SVE, SMT2). E is edge (low-power, in-order SMT).
"What was N1 derived from?" → Cortex-A76, re-validated with server-grade RAS, larger TLBs, CMN-600 egress.
"Why did Graviton win in AWS?" → ~40% better perf/$ and perf/W than contemporary x86 Xeon/Epyc on cloud-native workloads. Plus AWS owns the design, cutting Intel/AMD margin.

"What is CSS?" → Compute Sub-System: pre-integrated Neoverse + CMN + GIC + SMMU + CoreSight + DDR/CXL PHY. Makes chiplet integration a ~6-month task instead of ~2 years.
"What SVE width does V1 ship?" → 2 × 256-bit per pipe, highest Neoverse SVE width. V2 and V3 went back to 4 × 128 for a different throughput shape.
"Why SMT2 on V-series but not N?" → HPC + DB workloads memory-stall. SMT2 hides the stall at cost of per-thread IPC. Cloud scale-out is already parallel across cores; SMT would fight cache pressure.

References

Arm Ltd. — Neoverse TRMs (N1, V1, N2, V2, N3, V3) — freely downloadable on developer.arm.com
Arm Ltd. — Neoverse Tech Day 2022 / 2024 keynotes and whitepapers
Arm Ltd. — Arm Compute Sub-Systems (CSS) product briefs
AWS — Graviton 2 / 3 / 4 performance whitepapers — aws.amazon.com/ec2/graviton
NVIDIA — NVIDIA Grace CPU superchip architecture whitepaper
ServeTheHome / Phoronix / Anandtech — independent Neoverse benchmark reviews (2019-2024)
Chipsandcheese.com — microarchitecture deep-dives on Neoverse N1/V1/V2
SiPearl / Jupiter EuroHPC — Rhea1 architecture papers

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.