ARM NEOVERSE · PRESENTATION 03

CMN Mesh Interconnect

CMN-600 · 650 · 700 · S3 · the fabric that makes 192-core sockets possible

RN-F · HN-F · SN-F · RN-I · MN · SLC slicing · Snoop filter · CHI-B/C/E · CCIX · CXL · UCIe

Why a Mesh, Not a Bus

Up to ~16 cores, a single shared coherent bus (AMBA ACE) works.
Beyond ~16 cores the bus becomes the bottleneck: every snoop broadcasts to every core.
Arm's answer (2017): a 2-D mesh with CHI (Coherent Hub Interface) transport. Each tile routes packets independently; no shared bus.
Snoops become directory-filtered — only cores known to hold a line are snooped.
Mesh is what makes 128-192 cores per socket feasible — AMBA 4 ACE topped out around 16 coherent cores.

Node Types in a CMN Mesh

Node	Type	Role
RN-F	Fully-coherent Request Node	CPU cluster (Neoverse cores). Has a cache; issues coherent REQs.
RN-D	Request Node with DVM	Variant that handles Distributed Virtual Memory (TLB invalidate broadcasts)
RN-I	I/O Request Node	IO bridge — PCIe, SMMU, Network, accelerator; non-caching requester
HN-F	Fully-coherent Home Node	Owns a slice of the SLC; snoop filter resides here; directory for the lines it hosts
HN-I	I/O Home Node	Home for memory-mapped I/O addresses
SN-F	Fully-coherent Slave Node	Memory controller — backs HN-F misses to DRAM
MN	Miscellaneous Node	Barriers, events, resets, debug; system coordination services
CML	CXS / Chiplet Link	Bridges mesh to another mesh (chiplet-to-chiplet over UCIe or CCIX)

The HN-F is the star of the show — it holds the snoop filter (directory) for its address range + a slice of the shared last-level cache (SLC). Every read to a coherent address goes to the HN-F for that address.

SLC Slicing — Distributed Last-Level Cache

The SLC (System Level Cache) is distributed across HN-F nodes. Each HN-F holds a slice.
Address hash picks the HN-F: usually hash(PA[47:6]) → HN-F index. Sometimes the hash is interleaved by cache line to spread bandwidth; sometimes by page to improve locality.
Typical SLC slice: 2-8 MB per HN-F. A 32-HN-F mesh with 8 MB slices = 256 MB SLC.
Neoverse V2 Grace: 117 MB SLC across the CMN-700 mesh.
Graviton 3: ~64 MB. Graviton 4: 192 MB.
Slice selection is normally fixed in hardware; MPAM partitioning adjusts how much of each slice a tenant can use.

Hash-based interleaving

Hashing by cache-line index (XOR of high + low address bits) gives uniform load across HN-Fs. The cost: a sequential memory stream touches many HN-Fs on every line — more mesh traversal. Hash-by-page trades load balance for locality.

The snoop filter lives here too

Each HN-F keeps a directory of which RN-Fs cache the lines it owns. Snoops are targeted, not broadcast — this is what makes 128+ core scaling possible.

The Four CHI Message Channels

Channel	Direction	Purpose
REQ (Request)	RN → HN	Read / Write / Atomic / Snoop-response request
RSP (Response)	HN / SN → RN	Completion, retry, credit returns
SNP (Snoop)	HN → RN-F	Coherence snoop — ReadShared / ReadUnique / Invalidate
DAT (Data)	Any → Any	Cache-line data payload (64 B)

Each channel is independent — backpressure on one doesn't block the others. Credit-based flow control: requester gets a "slot" before issuing.

A canonical read

1) RN-F issues ReadShared on REQ to HN-F. 2) HN-F consults snoop filter: if another RN-F has it Unique, HN-F sends SnpShared on SNP to that RN-F. 3) RN-F sends data on DAT to HN-F, RSP to acknowledge. 4) HN-F forwards DAT to the original requester, plus a completion RSP. 5) The requester's cache transitions I → SC.

DMT & DCT — Direct Transfer Optimizations

DMT — Direct Memory Transfer. When HN-F sees that its SLC doesn't have the line and it's going to DRAM, it tells the SN-F (memory controller) to send the data directly to the requester, bypassing the HN-F.
DCT — Direct Cache Transfer. When another RN-F holds the line, HN-F arranges for that RN-F to send data directly to the requester, not through HN-F.
Both save a hop + buffer at the HN-F. Big effect on latency-sensitive DB/KV workloads (typical L3-miss latency drops ~20 cycles).
CMN-700 adds DVMv9 snoop filtering — TLB invalidates (DVM messages) are filtered by scoped VMID / ASID, avoiding bus-wide broadcasts.

Why DMT / DCT matter

DDR access is already 70-90 ns. Adding another 20+ cycles of mesh hops at the HN-F matters for P99 latency on Redis, Memcached, Cassandra. These optimisations are why CMN-700 beats earlier meshes by 2-3× on tail latency.

Cache stashing

CHI also supports "cache stashing" — I/O devices (NIC, NVMe) can hint data into a CPU's L2 ahead of the CPU reading it. Widely used for NIC → CPU packet processing.

CMN-600 — the First Neoverse Mesh (2018)

Launched in 2018. Up to 8×8 mesh = 64 nodes, 128 coherent RN-Fs.
CHI-A transport. Up to 1 TB/s aggregate bandwidth on a single mesh.
Supports cache stashing, MPAM v0, RAS framework.
Shipping in: AWS Graviton 2, Ampere Altra, eMAG, Huawei Kunpeng 920.
Limitation: no native PCIe Gen 4/5, no CXL. Had to use external NICs.

CMN-650 (2020)

Mid-step refresh. CHI-B transport.
Added CCIX 1.1 support (coherent accelerator attach over PCIe).
First hooks for CXL 1.1.

Graviton 2 topology

64 N1 cores across a 4×8 subset of the mesh (32 tiles used). 2 × 16 MB SLC per half, total 32 MB. 8 × DDR4-3200 channels, 4 × PCIe Gen 4 roots. ~400 W TDP at 64 cores active.

Ampere Altra sidenote

Same CMN-600, but with 80 N1 cores at 3.0 GHz (later 3.3 GHz for Altra Max). Ampere was the first commodity N1 server — Amazon's Graviton 2 was technically first but exclusive.

CMN-700 — the v9 Generation (2021)

Up to 12×12 mesh = 144 nodes, up to 256 RN-Fs.
CHI-E transport. Higher radix per switch, wider data paths (up to 512-bit).
CXL 2.0 support via CXS-to-CXL bridge; allows memory pooling.
MPAM v1 mandatory — QoS on bandwidth + cache partitioning.
DVM filtering — targeted TLB invalidates rather than broadcast.
Introduced slice hashing by locality as an option — reduces mesh traversal on sequential streams.
Shipping in: Microsoft Cobalt 100, Alibaba Yitian 710, Graviton 4, NVIDIA Grace.

CXL 2.0 as architected

CMN-700 presents HN-F home nodes for CXL.mem ranges — remote memory pool appears as coherent DRAM. Hyperscale use: disaggregated DRAM tiers, giving tens of TB per rack addressable from every socket.

NVLink-C2C piggybacks

NVIDIA Grace uses CMN-700 with a proprietary C2C bridge from CHI to NVLink — 900 GB/s coherent to Hopper/Blackwell GPU. Same semantics as CXL but much higher bandwidth.

CMN S3 — the 2024 Chiplet-Aware Mesh

Launched at Arm Neoverse Tech Day 2024. Paired with N3 / V3 + CSS.
Key shifts vs CMN-700:
- CHI-E2 transport — wider, higher-frequency data paths
- Unified on-die + chiplet topology — the mesh spans multiple chiplets over UCIe, presenting a single coherence domain
- CXL 3.0 — type 1/2/3 devices, multi-host, switchable fabric
- CCA / Realms — Granule Protection Checks happen at HN-F
- Higher HN-F radix; better SLC access latency at large mesh sizes
Target silicon: AWS Graviton 5 (reported), Microsoft Cobalt 200, multi-chiplet HPC/AI servers.

Chiplet coherence

CMN S3 treats UCIe links like internal mesh links. A read from a core on chiplet A to a cache line owned by an RN-F on chiplet B looks identical to on-die traffic — same REQ/SNP/RSP/DAT flow, just traverses UCIe.

GPC at HN-F

Every read/write to physical memory goes through a Granule Protection Check against the GPT (Granule Protection Table). HN-F is the natural enforcement point — it's the last stop before DRAM.

CMN Generations Side-by-Side

Gen	Year	Max mesh	CHI version	PCIe / CXL / CCIX	Shipping in
CMN-600	2018	8×8 (64 tiles)	CHI-A	PCIe 4.0, no CXL	Graviton 2, Altra, Kunpeng 920
CMN-650	2020	8×8	CHI-B	CCIX 1.1, CXL 1.1 (hooks)	Older Altra derivatives
CMN-700	2021	12×12 (144)	CHI-E	CXL 2.0, PCIe 5.0	Graviton 3/4, Cobalt 100, Grace
CMN S3	2024	Multi-chiplet	CHI-E2	CXL 3.0, PCIe 6.0, UCIe	Cobalt 200 (reported), Graviton 5

The trend: more tiles per mesh, higher-frequency links, and cleaner integration with CXL (for memory pooling) and UCIe (for chiplets). Every CHI generation is a strict superset of the prior one.

Memory Controllers & SN-F Nodes

Each memory channel has an SN-F node on the mesh.
SN-F nodes map to Arm memory-controller IP:
- DMC-520 — LPDDR4 / DDR4, 32-bit
- DMC-620 — DDR4, 64-bit, CMN-600 era
- DMC-750 — DDR5 + RDIMM/LRDIMM, CMN-700 era
- DMC-1000 (2024) — DDR5-6400, CXL 2.0, ECC-chipkill
Additional SN-F variants implement HBM (Grace + 480 GB LPDDR5X is an exception), CXL.mem (pooled memory), and Flash/NVDIMM.
Memory scrubbing happens at SN-F — patrol + on-demand scrubbers catch soft-error rates typical of large-DRAM server deployments.

Hash → channel placement

For a flat memory view, the address → HN-F → SN-F hash must be channel-aware: if HN-F 0x0 maps to SN-F on channel 0, sequential cache-line accesses should spread across all SN-Fs to max bandwidth. Software doesn't see this — pure hardware.

8-12 channels standard

Neoverse servers use 8-12 DDR5 channels per socket. Aggregated bandwidth: 300-600 GB/s per socket, enough to keep 128 cores memory-fed for most cloud workloads.

I/O Sub-system — RN-I, SMMU, PCIe

An RN-I (I/O Request Node) sits on the mesh for every I/O domain. It bridges non-coherent AXI/CXL/PCIe traffic into coherent CHI.
Each RN-I is preceded by an SMMU-v3 that:
- Translates I/O virtual addresses → PA (stage 1)
- Enforces VM isolation (stage 2)
- Checks PCIe STE/CD entries
- Supports PCIe ATS + PRI (Address Translation Service, Page Request Interface)
PCIe Gen 5 / 6 controllers typically integrated on the mesh edge with dedicated RN-I nodes.
CXL devices can present as RN-F (type-1 accelerator, with its own cache) or SN-F (type-3 memory expander).

Performance Features of CMN

SLC Cache Stashing — NIC/NVMe can stash packet data into the LLC before the CPU reads it.
QoS classes — 16 QoS levels per REQ; MPAM-mapped.
Dynamic snoop filter expansion — HN-Fs can switch between inclusive / partial-inclusive snoop filtering based on workload pressure.
Ordering domains — REQ ordering can be relaxed per address space for bandwidth (used for accelerator + GPU traffic).
Flow-control credits per VC — prevents head-of-line blocking between REQ/SNP/DAT.
Error containment — RAS framework routes poison to the node best placed to handle it (usually HN-F or memory controller).

Arm PMU on the mesh

Each CMN node exposes its own PMU. Total ~500 events across a large mesh. Linux tools (arm_cmn driver + perf) expose them as arm_cmn_0/hnf_cache_miss/ etc. Essential for diagnosing LLC / memory-bandwidth issues at socket level.

QoS in practice

Cloud tenants tagged "gold" get QoS 0-3 (highest priority); best-effort jobs get QoS 12-15. At high mesh pressure, low-QoS traffic queues; high-QoS cuts through.

CMN vs Intel Mesh vs AMD Infinity Fabric

Property	Arm CMN-700 / S3	Intel Mesh (Sapphire/Granite)	AMD Infinity Fabric 4
Topology	2D mesh of tiles	2D mesh of tiles	Hub-and-spoke + IFOP across chiplets
Coherence	AMBA 5 CHI, directory at HN-F	MESIF, distributed LLC slice	MOESI with probe filter at IOD
LLC	Distributed SLC (128-512 MB)	Distributed LLC (60-120 MB)	IO-die-attached L3+stacked (V-cache 96-192 MB)
Snoop filter	Tag-directory at HN-F	Full directory at mesh nodes	Probe filter on I/O die
Cross-chiplet	UCIe (CMN S3) / custom C2C	EMIB + chip-to-chip interposers	Infinity Fabric serial links
External coh	CCIX, CXL 2/3	CXL 1.1 / 2.0	CXL 1.1 / 2.0

All three converged on mesh + directory coherence for 64+ cores. Biggest differentiator for Arm: standardised CHI protocol, which means IP from Arm, Synopsys, and customers interoperate at RTL level.

Lessons

"Why is HN-F the key node?" → owns the snoop-filter directory + a slice of SLC. Every coherent read/write to its address range goes there. Scales to >128 cores because snoops are targeted, not broadcast.
"Explain DMT vs DCT" → DMT = HN-F tells SN-F to forward DRAM data directly to requester (skip HN-F buffer). DCT = HN-F tells another RN-F to forward cached data directly. Both cut L3-miss latency by ~20 cycles.

"How does CXL.mem integrate?" → external memory expanders appear as SN-F nodes. HN-Fs own the CXL address ranges and enforce coherence. Software sees a normal RAM region with slightly higher latency.
"What's MPAM?" → partition ID (PartID) tagged on every transaction; HN-F enforces cache-way + bandwidth limits per PartID. Used for cloud multi-tenancy QoS.
"Why CHI and not AXI?" → AXI-ACE topped out at ~16 coherent nodes because snoops were broadcast. CHI is a packet-switched protocol that scales with the mesh.

References

Arm Ltd. — CMN-600 / 650 / 700 Technical Reference Manuals — node types, mesh, CHI programmer's view
Arm Ltd. — AMBA 5 CHI Architecture Specification (Issue E / E2) — the complete CHI protocol
Arm Ltd. — MPAM system architecture (DDI 0598)
Arm Ltd. — Neoverse CSS N2 / V2 / V3 reference designs (2023-24)
AWS — Graviton 2/3/4 whitepapers with CMN topology + SLC sizing
Hot Chips — annual Neoverse interconnect papers
Chipsandcheese.com — measurement-based CMN latency + bandwidth deep-dives
Linux kernel — drivers/perf/arm-cmn.c — CMN PMU driver with full node / event catalogue

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.