Owns a slice of the SLC; snoop filter resides here; directory for the lines it hosts
HN-I
I/O Home Node
Home for memory-mapped I/O addresses
SN-F
Fully-coherent Slave Node
Memory controller — backs HN-F misses to DRAM
MN
Miscellaneous Node
Barriers, events, resets, debug; system coordination services
CML
CXS / Chiplet Link
Bridges mesh to another mesh (chiplet-to-chiplet over UCIe or CCIX)
The HN-F is the star of the show — it holds the snoop filter (directory) for its address range + a slice of the shared last-level cache (SLC). Every read to a coherent address goes to the HN-F for that address.
04
SLC Slicing — Distributed Last-Level Cache
The SLC (System Level Cache) is distributed across HN-F nodes. Each HN-F holds a slice.
Address hash picks the HN-F: usually hash(PA[47:6]) → HN-F index. Sometimes the hash is interleaved by cache line to spread bandwidth; sometimes by page to improve locality.
Typical SLC slice: 2-8 MB per HN-F. A 32-HN-F mesh with 8 MB slices = 256 MB SLC.
Neoverse V2 Grace: 117 MB SLC across the CMN-700 mesh.
Graviton 3: ~64 MB. Graviton 4: 192 MB.
Slice selection is normally fixed in hardware; MPAM partitioning adjusts how much of each slice a tenant can use.
Hash-based interleaving
Hashing by cache-line index (XOR of high + low address bits) gives uniform load across HN-Fs. The cost: a sequential memory stream touches many HN-Fs on every line — more mesh traversal. Hash-by-page trades load balance for locality.
The snoop filter lives here too
Each HN-F keeps a directory of which RN-Fs cache the lines it owns. Snoops are targeted, not broadcast — this is what makes 128+ core scaling possible.
Each channel is independent — backpressure on one doesn't block the others. Credit-based flow control: requester gets a "slot" before issuing.
A canonical read
1) RN-F issues ReadShared on REQ to HN-F. 2) HN-F consults snoop filter: if another RN-F has it Unique, HN-F sends SnpShared on SNP to that RN-F. 3) RN-F sends data on DAT to HN-F, RSP to acknowledge. 4) HN-F forwards DAT to the original requester, plus a completion RSP. 5) The requester's cache transitions I → SC.
06
DMT & DCT — Direct Transfer Optimizations
DMT — Direct Memory Transfer. When HN-F sees that its SLC doesn't have the line and it's going to DRAM, it tells the SN-F (memory controller) to send the data directly to the requester, bypassing the HN-F.
DCT — Direct Cache Transfer. When another RN-F holds the line, HN-F arranges for that RN-F to send data directly to the requester, not through HN-F.
Both save a hop + buffer at the HN-F. Big effect on latency-sensitive DB/KV workloads (typical L3-miss latency drops ~20 cycles).
CMN-700 adds DVMv9 snoop filtering — TLB invalidates (DVM messages) are filtered by scoped VMID / ASID, avoiding bus-wide broadcasts.
Why DMT / DCT matter
DDR access is already 70-90 ns. Adding another 20+ cycles of mesh hops at the HN-F matters for P99 latency on Redis, Memcached, Cassandra. These optimisations are why CMN-700 beats earlier meshes by 2-3× on tail latency.
Cache stashing
CHI also supports "cache stashing" — I/O devices (NIC, NVMe) can hint data into a CPU's L2 ahead of the CPU reading it. Widely used for NIC → CPU packet processing.
07
CMN-600 — the First Neoverse Mesh (2018)
Launched in 2018. Up to 8×8 mesh = 64 nodes, 128 coherent RN-Fs.
CHI-A transport. Up to 1 TB/s aggregate bandwidth on a single mesh.
Limitation: no native PCIe Gen 4/5, no CXL. Had to use external NICs.
CMN-650 (2020)
Mid-step refresh. CHI-B transport.
Added CCIX 1.1 support (coherent accelerator attach over PCIe).
First hooks for CXL 1.1.
Graviton 2 topology
64 N1 cores across a 4×8 subset of the mesh (32 tiles used). 2 × 16 MB SLC per half, total 32 MB. 8 × DDR4-3200 channels, 4 × PCIe Gen 4 roots. ~400 W TDP at 64 cores active.
Ampere Altra sidenote
Same CMN-600, but with 80 N1 cores at 3.0 GHz (later 3.3 GHz for Altra Max). Ampere was the first commodity N1 server — Amazon's Graviton 2 was technically first but exclusive.
08
CMN-700 — the v9 Generation (2021)
Up to 12×12 mesh = 144 nodes, up to 256 RN-Fs.
CHI-E transport. Higher radix per switch, wider data paths (up to 512-bit).
CXL 2.0 support via CXS-to-CXL bridge; allows memory pooling.
MPAM v1 mandatory — QoS on bandwidth + cache partitioning.
DVM filtering — targeted TLB invalidates rather than broadcast.
Introduced slice hashing by locality as an option — reduces mesh traversal on sequential streams.
CMN-700 presents HN-F home nodes for CXL.mem ranges — remote memory pool appears as coherent DRAM. Hyperscale use: disaggregated DRAM tiers, giving tens of TB per rack addressable from every socket.
NVLink-C2C piggybacks
NVIDIA Grace uses CMN-700 with a proprietary C2C bridge from CHI to NVLink — 900 GB/s coherent to Hopper/Blackwell GPU. Same semantics as CXL but much higher bandwidth.
09
CMN S3 — the 2024 Chiplet-Aware Mesh
Launched at Arm Neoverse Tech Day 2024. Paired with N3 / V3 + CSS.
Key shifts vs CMN-700:
CHI-E2 transport — wider, higher-frequency data paths
Unified on-die + chiplet topology — the mesh spans multiple chiplets over UCIe, presenting a single coherence domain
CXL 3.0 — type 1/2/3 devices, multi-host, switchable fabric
CCA / Realms — Granule Protection Checks happen at HN-F
Higher HN-F radix; better SLC access latency at large mesh sizes
CMN S3 treats UCIe links like internal mesh links. A read from a core on chiplet A to a cache line owned by an RN-F on chiplet B looks identical to on-die traffic — same REQ/SNP/RSP/DAT flow, just traverses UCIe.
GPC at HN-F
Every read/write to physical memory goes through a Granule Protection Check against the GPT (Granule Protection Table). HN-F is the natural enforcement point — it's the last stop before DRAM.
10
CMN Generations Side-by-Side
Gen
Year
Max mesh
CHI version
PCIe / CXL / CCIX
Shipping in
CMN-600
2018
8×8 (64 tiles)
CHI-A
PCIe 4.0, no CXL
Graviton 2, Altra, Kunpeng 920
CMN-650
2020
8×8
CHI-B
CCIX 1.1, CXL 1.1 (hooks)
Older Altra derivatives
CMN-700
2021
12×12 (144)
CHI-E
CXL 2.0, PCIe 5.0
Graviton 3/4, Cobalt 100, Grace
CMN S3
2024
Multi-chiplet
CHI-E2
CXL 3.0, PCIe 6.0, UCIe
Cobalt 200 (reported), Graviton 5
The trend: more tiles per mesh, higher-frequency links, and cleaner integration with CXL (for memory pooling) and UCIe (for chiplets). Every CHI generation is a strict superset of the prior one.
Additional SN-F variants implement HBM (Grace + 480 GB LPDDR5X is an exception), CXL.mem (pooled memory), and Flash/NVDIMM.
Memory scrubbing happens at SN-F — patrol + on-demand scrubbers catch soft-error rates typical of large-DRAM server deployments.
Hash → channel placement
For a flat memory view, the address → HN-F → SN-F hash must be channel-aware: if HN-F 0x0 maps to SN-F on channel 0, sequential cache-line accesses should spread across all SN-Fs to max bandwidth. Software doesn't see this — pure hardware.
8-12 channels standard
Neoverse servers use 8-12 DDR5 channels per socket. Aggregated bandwidth: 300-600 GB/s per socket, enough to keep 128 cores memory-fed for most cloud workloads.
12
I/O Sub-system — RN-I, SMMU, PCIe
An RN-I (I/O Request Node) sits on the mesh for every I/O domain. It bridges non-coherent AXI/CXL/PCIe traffic into coherent CHI.
PCIe Gen 5 / 6 controllers typically integrated on the mesh edge with dedicated RN-I nodes.
CXL devices can present as RN-F (type-1 accelerator, with its own cache) or SN-F (type-3 memory expander).
13
Performance Features of CMN
SLC Cache Stashing — NIC/NVMe can stash packet data into the LLC before the CPU reads it.
QoS classes — 16 QoS levels per REQ; MPAM-mapped.
Dynamic snoop filter expansion — HN-Fs can switch between inclusive / partial-inclusive snoop filtering based on workload pressure.
Ordering domains — REQ ordering can be relaxed per address space for bandwidth (used for accelerator + GPU traffic).
Flow-control credits per VC — prevents head-of-line blocking between REQ/SNP/DAT.
Error containment — RAS framework routes poison to the node best placed to handle it (usually HN-F or memory controller).
Arm PMU on the mesh
Each CMN node exposes its own PMU. Total ~500 events across a large mesh. Linux tools (arm_cmn driver + perf) expose them as arm_cmn_0/hnf_cache_miss/ etc. Essential for diagnosing LLC / memory-bandwidth issues at socket level.
QoS in practice
Cloud tenants tagged "gold" get QoS 0-3 (highest priority); best-effort jobs get QoS 12-15. At high mesh pressure, low-QoS traffic queues; high-QoS cuts through.
14
CMN vs Intel Mesh vs AMD Infinity Fabric
Property
Arm CMN-700 / S3
Intel Mesh (Sapphire/Granite)
AMD Infinity Fabric 4
Topology
2D mesh of tiles
2D mesh of tiles
Hub-and-spoke + IFOP across chiplets
Coherence
AMBA 5 CHI, directory at HN-F
MESIF, distributed LLC slice
MOESI with probe filter at IOD
LLC
Distributed SLC (128-512 MB)
Distributed LLC (60-120 MB)
IO-die-attached L3+stacked (V-cache 96-192 MB)
Snoop filter
Tag-directory at HN-F
Full directory at mesh nodes
Probe filter on I/O die
Cross-chiplet
UCIe (CMN S3) / custom C2C
EMIB + chip-to-chip interposers
Infinity Fabric serial links
External coh
CCIX, CXL 2/3
CXL 1.1 / 2.0
CXL 1.1 / 2.0
All three converged on mesh + directory coherence for 64+ cores. Biggest differentiator for Arm: standardised CHI protocol, which means IP from Arm, Synopsys, and customers interoperate at RTL level.
15
Lessons
"Why is HN-F the key node?" → owns the snoop-filter directory + a slice of SLC. Every coherent read/write to its address range goes there. Scales to >128 cores because snoops are targeted, not broadcast.
"Explain DMT vs DCT" → DMT = HN-F tells SN-F to forward DRAM data directly to requester (skip HN-F buffer). DCT = HN-F tells another RN-F to forward cached data directly. Both cut L3-miss latency by ~20 cycles.
"How does CXL.mem integrate?" → external memory expanders appear as SN-F nodes. HN-Fs own the CXL address ranges and enforce coherence. Software sees a normal RAM region with slightly higher latency.
"What's MPAM?" → partition ID (PartID) tagged on every transaction; HN-F enforces cache-way + bandwidth limits per PartID. Used for cloud multi-tenancy QoS.
"Why CHI and not AXI?" → AXI-ACE topped out at ~16 coherent nodes because snoops were broadcast. CHI is a packet-switched protocol that scales with the mesh.
16
References
Arm Ltd. — CMN-600 / 650 / 700 Technical Reference Manuals — node types, mesh, CHI programmer's view Arm Ltd. — AMBA 5 CHI Architecture Specification (Issue E / E2) — the complete CHI protocol Arm Ltd. — MPAM system architecture (DDI 0598) Arm Ltd. — Neoverse CSS N2 / V2 / V3 reference designs (2023-24) AWS — Graviton 2/3/4 whitepapers with CMN topology + SLC sizing Hot Chips — annual Neoverse interconnect papers Chipsandcheese.com — measurement-based CMN latency + bandwidth deep-dives Linux kernel — drivers/perf/arm-cmn.c — CMN PMU driver with full node / event catalogue
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.