ARM NEOVERSE · PRESENTATION 03

CMN Mesh Interconnect

CMN-600 · 650 · 700 · S3 · the fabric that makes 192-core sockets possible
RN-F · HN-F · SN-F · RN-I · MN · SLC slicing · Snoop filter · CHI-B/C/E · CCIX · CXL · UCIe
02

Why a Mesh, Not a Bus

  • Up to ~16 cores, a single shared coherent bus (AMBA ACE) works.
  • Beyond ~16 cores the bus becomes the bottleneck: every snoop broadcasts to every core.
  • Arm's answer (2017): a 2-D mesh with CHI (Coherent Hub Interface) transport. Each tile routes packets independently; no shared bus.
  • Snoops become directory-filtered — only cores known to hold a line are snooped.
  • Mesh is what makes 128-192 cores per socket feasible — AMBA 4 ACE topped out around 16 coherent cores.
CMN mesh, 4×4 tiles R R H S R H R I S R H R I S R H R=RN-F (core) H=HN-F (home/SLC) S=SN-F (DRAM) I=RN-I (I/O)
03

Node Types in a CMN Mesh

NodeTypeRole
RN-FFully-coherent Request NodeCPU cluster (Neoverse cores). Has a cache; issues coherent REQs.
RN-DRequest Node with DVMVariant that handles Distributed Virtual Memory (TLB invalidate broadcasts)
RN-II/O Request NodeIO bridge — PCIe, SMMU, Network, accelerator; non-caching requester
HN-FFully-coherent Home NodeOwns a slice of the SLC; snoop filter resides here; directory for the lines it hosts
HN-II/O Home NodeHome for memory-mapped I/O addresses
SN-FFully-coherent Slave NodeMemory controller — backs HN-F misses to DRAM
MNMiscellaneous NodeBarriers, events, resets, debug; system coordination services
CMLCXS / Chiplet LinkBridges mesh to another mesh (chiplet-to-chiplet over UCIe or CCIX)

The HN-F is the star of the show — it holds the snoop filter (directory) for its address range + a slice of the shared last-level cache (SLC). Every read to a coherent address goes to the HN-F for that address.

04

SLC Slicing — Distributed Last-Level Cache

  • The SLC (System Level Cache) is distributed across HN-F nodes. Each HN-F holds a slice.
  • Address hash picks the HN-F: usually hash(PA[47:6]) → HN-F index. Sometimes the hash is interleaved by cache line to spread bandwidth; sometimes by page to improve locality.
  • Typical SLC slice: 2-8 MB per HN-F. A 32-HN-F mesh with 8 MB slices = 256 MB SLC.
  • Neoverse V2 Grace: 117 MB SLC across the CMN-700 mesh.
  • Graviton 3: ~64 MB. Graviton 4: 192 MB.
  • Slice selection is normally fixed in hardware; MPAM partitioning adjusts how much of each slice a tenant can use.

Hash-based interleaving

Hashing by cache-line index (XOR of high + low address bits) gives uniform load across HN-Fs. The cost: a sequential memory stream touches many HN-Fs on every line — more mesh traversal. Hash-by-page trades load balance for locality.

The snoop filter lives here too

Each HN-F keeps a directory of which RN-Fs cache the lines it owns. Snoops are targeted, not broadcast — this is what makes 128+ core scaling possible.

05

The Four CHI Message Channels

ChannelDirectionPurpose
REQ (Request)RN → HNRead / Write / Atomic / Snoop-response request
RSP (Response)HN / SN → RNCompletion, retry, credit returns
SNP (Snoop)HN → RN-FCoherence snoop — ReadShared / ReadUnique / Invalidate
DAT (Data)Any → AnyCache-line data payload (64 B)

Each channel is independent — backpressure on one doesn't block the others. Credit-based flow control: requester gets a "slot" before issuing.

A canonical read

1) RN-F issues ReadShared on REQ to HN-F. 2) HN-F consults snoop filter: if another RN-F has it Unique, HN-F sends SnpShared on SNP to that RN-F. 3) RN-F sends data on DAT to HN-F, RSP to acknowledge. 4) HN-F forwards DAT to the original requester, plus a completion RSP. 5) The requester's cache transitions I → SC.

06

DMT & DCT — Direct Transfer Optimizations

  • DMT — Direct Memory Transfer. When HN-F sees that its SLC doesn't have the line and it's going to DRAM, it tells the SN-F (memory controller) to send the data directly to the requester, bypassing the HN-F.
  • DCT — Direct Cache Transfer. When another RN-F holds the line, HN-F arranges for that RN-F to send data directly to the requester, not through HN-F.
  • Both save a hop + buffer at the HN-F. Big effect on latency-sensitive DB/KV workloads (typical L3-miss latency drops ~20 cycles).
  • CMN-700 adds DVMv9 snoop filtering — TLB invalidates (DVM messages) are filtered by scoped VMID / ASID, avoiding bus-wide broadcasts.

Why DMT / DCT matter

DDR access is already 70-90 ns. Adding another 20+ cycles of mesh hops at the HN-F matters for P99 latency on Redis, Memcached, Cassandra. These optimisations are why CMN-700 beats earlier meshes by 2-3× on tail latency.

Cache stashing

CHI also supports "cache stashing" — I/O devices (NIC, NVMe) can hint data into a CPU's L2 ahead of the CPU reading it. Widely used for NIC → CPU packet processing.

07

CMN-600 — the First Neoverse Mesh (2018)

  • Launched in 2018. Up to 8×8 mesh = 64 nodes, 128 coherent RN-Fs.
  • CHI-A transport. Up to 1 TB/s aggregate bandwidth on a single mesh.
  • Supports cache stashing, MPAM v0, RAS framework.
  • Shipping in: AWS Graviton 2, Ampere Altra, eMAG, Huawei Kunpeng 920.
  • Limitation: no native PCIe Gen 4/5, no CXL. Had to use external NICs.

CMN-650 (2020)

  • Mid-step refresh. CHI-B transport.
  • Added CCIX 1.1 support (coherent accelerator attach over PCIe).
  • First hooks for CXL 1.1.

Graviton 2 topology

64 N1 cores across a 4×8 subset of the mesh (32 tiles used). 2 × 16 MB SLC per half, total 32 MB. 8 × DDR4-3200 channels, 4 × PCIe Gen 4 roots. ~400 W TDP at 64 cores active.

Ampere Altra sidenote

Same CMN-600, but with 80 N1 cores at 3.0 GHz (later 3.3 GHz for Altra Max). Ampere was the first commodity N1 server — Amazon's Graviton 2 was technically first but exclusive.

08

CMN-700 — the v9 Generation (2021)

  • Up to 12×12 mesh = 144 nodes, up to 256 RN-Fs.
  • CHI-E transport. Higher radix per switch, wider data paths (up to 512-bit).
  • CXL 2.0 support via CXS-to-CXL bridge; allows memory pooling.
  • MPAM v1 mandatory — QoS on bandwidth + cache partitioning.
  • DVM filtering — targeted TLB invalidates rather than broadcast.
  • Introduced slice hashing by locality as an option — reduces mesh traversal on sequential streams.
  • Shipping in: Microsoft Cobalt 100, Alibaba Yitian 710, Graviton 4, NVIDIA Grace.

CXL 2.0 as architected

CMN-700 presents HN-F home nodes for CXL.mem ranges — remote memory pool appears as coherent DRAM. Hyperscale use: disaggregated DRAM tiers, giving tens of TB per rack addressable from every socket.

NVLink-C2C piggybacks

NVIDIA Grace uses CMN-700 with a proprietary C2C bridge from CHI to NVLink — 900 GB/s coherent to Hopper/Blackwell GPU. Same semantics as CXL but much higher bandwidth.

09

CMN S3 — the 2024 Chiplet-Aware Mesh

  • Launched at Arm Neoverse Tech Day 2024. Paired with N3 / V3 + CSS.
  • Key shifts vs CMN-700:
    • CHI-E2 transport — wider, higher-frequency data paths
    • Unified on-die + chiplet topology — the mesh spans multiple chiplets over UCIe, presenting a single coherence domain
    • CXL 3.0 — type 1/2/3 devices, multi-host, switchable fabric
    • CCA / Realms — Granule Protection Checks happen at HN-F
    • Higher HN-F radix; better SLC access latency at large mesh sizes
  • Target silicon: AWS Graviton 5 (reported), Microsoft Cobalt 200, multi-chiplet HPC/AI servers.

Chiplet coherence

CMN S3 treats UCIe links like internal mesh links. A read from a core on chiplet A to a cache line owned by an RN-F on chiplet B looks identical to on-die traffic — same REQ/SNP/RSP/DAT flow, just traverses UCIe.

GPC at HN-F

Every read/write to physical memory goes through a Granule Protection Check against the GPT (Granule Protection Table). HN-F is the natural enforcement point — it's the last stop before DRAM.

10

CMN Generations Side-by-Side

GenYearMax meshCHI versionPCIe / CXL / CCIXShipping in
CMN-60020188×8 (64 tiles)CHI-APCIe 4.0, no CXLGraviton 2, Altra, Kunpeng 920
CMN-65020208×8CHI-BCCIX 1.1, CXL 1.1 (hooks)Older Altra derivatives
CMN-700202112×12 (144)CHI-ECXL 2.0, PCIe 5.0Graviton 3/4, Cobalt 100, Grace
CMN S32024Multi-chipletCHI-E2CXL 3.0, PCIe 6.0, UCIeCobalt 200 (reported), Graviton 5

The trend: more tiles per mesh, higher-frequency links, and cleaner integration with CXL (for memory pooling) and UCIe (for chiplets). Every CHI generation is a strict superset of the prior one.

11

Memory Controllers & SN-F Nodes

  • Each memory channel has an SN-F node on the mesh.
  • SN-F nodes map to Arm memory-controller IP:
    • DMC-520 — LPDDR4 / DDR4, 32-bit
    • DMC-620 — DDR4, 64-bit, CMN-600 era
    • DMC-750 — DDR5 + RDIMM/LRDIMM, CMN-700 era
    • DMC-1000 (2024) — DDR5-6400, CXL 2.0, ECC-chipkill
  • Additional SN-F variants implement HBM (Grace + 480 GB LPDDR5X is an exception), CXL.mem (pooled memory), and Flash/NVDIMM.
  • Memory scrubbing happens at SN-F — patrol + on-demand scrubbers catch soft-error rates typical of large-DRAM server deployments.

Hash → channel placement

For a flat memory view, the address → HN-F → SN-F hash must be channel-aware: if HN-F 0x0 maps to SN-F on channel 0, sequential cache-line accesses should spread across all SN-Fs to max bandwidth. Software doesn't see this — pure hardware.

8-12 channels standard

Neoverse servers use 8-12 DDR5 channels per socket. Aggregated bandwidth: 300-600 GB/s per socket, enough to keep 128 cores memory-fed for most cloud workloads.

12

I/O Sub-system — RN-I, SMMU, PCIe

  • An RN-I (I/O Request Node) sits on the mesh for every I/O domain. It bridges non-coherent AXI/CXL/PCIe traffic into coherent CHI.
  • Each RN-I is preceded by an SMMU-v3 that:
    • Translates I/O virtual addresses → PA (stage 1)
    • Enforces VM isolation (stage 2)
    • Checks PCIe STE/CD entries
    • Supports PCIe ATS + PRI (Address Translation Service, Page Request Interface)
  • PCIe Gen 5 / 6 controllers typically integrated on the mesh edge with dedicated RN-I nodes.
  • CXL devices can present as RN-F (type-1 accelerator, with its own cache) or SN-F (type-3 memory expander).
I/O path on CMN PCIe 5 / 6 controller SMMU-700 (stage 1 + 2, ATS, PRI) RN-I — on CMN mesh coherent CHI to HN-F / SLC
13

Performance Features of CMN

  • SLC Cache Stashing — NIC/NVMe can stash packet data into the LLC before the CPU reads it.
  • QoS classes — 16 QoS levels per REQ; MPAM-mapped.
  • Dynamic snoop filter expansion — HN-Fs can switch between inclusive / partial-inclusive snoop filtering based on workload pressure.
  • Ordering domains — REQ ordering can be relaxed per address space for bandwidth (used for accelerator + GPU traffic).
  • Flow-control credits per VC — prevents head-of-line blocking between REQ/SNP/DAT.
  • Error containment — RAS framework routes poison to the node best placed to handle it (usually HN-F or memory controller).

Arm PMU on the mesh

Each CMN node exposes its own PMU. Total ~500 events across a large mesh. Linux tools (arm_cmn driver + perf) expose them as arm_cmn_0/hnf_cache_miss/ etc. Essential for diagnosing LLC / memory-bandwidth issues at socket level.

QoS in practice

Cloud tenants tagged "gold" get QoS 0-3 (highest priority); best-effort jobs get QoS 12-15. At high mesh pressure, low-QoS traffic queues; high-QoS cuts through.

14

CMN vs Intel Mesh vs AMD Infinity Fabric

PropertyArm CMN-700 / S3Intel Mesh (Sapphire/Granite)AMD Infinity Fabric 4
Topology2D mesh of tiles2D mesh of tilesHub-and-spoke + IFOP across chiplets
CoherenceAMBA 5 CHI, directory at HN-FMESIF, distributed LLC sliceMOESI with probe filter at IOD
LLCDistributed SLC (128-512 MB)Distributed LLC (60-120 MB)IO-die-attached L3+stacked (V-cache 96-192 MB)
Snoop filterTag-directory at HN-FFull directory at mesh nodesProbe filter on I/O die
Cross-chipletUCIe (CMN S3) / custom C2CEMIB + chip-to-chip interposersInfinity Fabric serial links
External cohCCIX, CXL 2/3CXL 1.1 / 2.0CXL 1.1 / 2.0

All three converged on mesh + directory coherence for 64+ cores. Biggest differentiator for Arm: standardised CHI protocol, which means IP from Arm, Synopsys, and customers interoperate at RTL level.

15

Lessons

  • "Why is HN-F the key node?" → owns the snoop-filter directory + a slice of SLC. Every coherent read/write to its address range goes there. Scales to >128 cores because snoops are targeted, not broadcast.
  • "Explain DMT vs DCT" → DMT = HN-F tells SN-F to forward DRAM data directly to requester (skip HN-F buffer). DCT = HN-F tells another RN-F to forward cached data directly. Both cut L3-miss latency by ~20 cycles.
  • "How does CXL.mem integrate?" → external memory expanders appear as SN-F nodes. HN-Fs own the CXL address ranges and enforce coherence. Software sees a normal RAM region with slightly higher latency.
  • "What's MPAM?" → partition ID (PartID) tagged on every transaction; HN-F enforces cache-way + bandwidth limits per PartID. Used for cloud multi-tenancy QoS.
  • "Why CHI and not AXI?" → AXI-ACE topped out at ~16 coherent nodes because snoops were broadcast. CHI is a packet-switched protocol that scales with the mesh.
16

References

Arm Ltd.CMN-600 / 650 / 700 Technical Reference Manuals — node types, mesh, CHI programmer's view
Arm Ltd.AMBA 5 CHI Architecture Specification (Issue E / E2) — the complete CHI protocol
Arm Ltd.MPAM system architecture (DDI 0598)
Arm Ltd.Neoverse CSS N2 / V2 / V3 reference designs (2023-24)
AWS — Graviton 2/3/4 whitepapers with CMN topology + SLC sizing
Hot Chips — annual Neoverse interconnect papers
Chipsandcheese.com — measurement-based CMN latency + bandwidth deep-dives
Linux kerneldrivers/perf/arm-cmn.c — CMN PMU driver with full node / event catalogue

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.