ARM AMBA · PRESENTATION 05

CHI — Coherent Hub Interface

Packet-Based Coherency for 128+ Nodes · The Protocol Behind Neoverse
REQ · RSP · SNP · DAT · RN-F · HN-F · SN-F · CMN-600/650/700 · CHI-A/B/C/D/E/F
Navigate: → ←  |  Overview: Esc  |  Fullscreen: F
02

Why CHI?

  • ACE was a signal-level protocol — every master port physically owned AC/CR/CD wires running to the interconnect.
  • At 2 clusters: fine. At 8 clusters: the snoop fan-out starts to hurt. At 32+ coherent agents: the wire count becomes unmanageable.
  • For servers and high-end SoCs Arm needed a transport-agnostic, packet-based protocol: any coherent message becomes a packet that the interconnect routes over whatever topology fits (ring, mesh, hybrid).
  • First delivered as AMBA 5 CHI Issue A in 2013 alongside Arm's CCN-504 interconnect; refined through CHI-B/C/D/E/F as Neoverse and CMN-600/650/700 scaled up.

CHI in one sentence

CHI is a layered, packet-based coherent protocol with four message channels, typed node roles, and a home-based directory — designed to scale from 2 to 128+ coherent agents without changing the protocol.

The specification (IHI 0050) is unusually structured: a transport-agnostic protocol layer above the link-layer flit rules — you can implement CHI over any NoC that preserves per-channel ordering.
03

CHI Node Types

NodeRoleExample
RN-FRequest Node, Fully-coherent — has caches, issues coherent reqsCortex-A / Neoverse CPU cluster
RN-IRequest Node, I/O-coherent — no caches, snooped by nobodyNon-coherent DMA, NIC, AXI bridge
RN-DRequest Node with DVM — I/O-coherent plus DVM participationSMMU (MMU-600 / MMU-700)
HN-FHome Node, Fully-coherent — directory + System Level Cache sliceCMN-700 HN-F tile
HN-IHome Node, I/O — gateway to AXI I/O regionCMN-700 HN-I tile
SN-FSlave Node, Fully-coherent — memory endpointDDR / HBM controller, CXL.mem
MNMisc Node — broadcasts, DVM sync, etc.Central management block

The Home Node is the star

In ACE the coherency logic sat inside the CCI. In CHI it lives in the distributed HN-F tiles. Each HN-F owns a slice of the physical address space; all coherent requests for addresses in its slice flow through it.

This address-hash-to-home mapping is what turns CHI from "snoop everyone" into "ask the home, which knows who to ask."
04

The Four Message Channels

  • Every CHI link (a bidirectional port between two nodes) carries four virtual channels:
    • REQ — request from RN / HN (ReadShared, ReadUnique, WriteBack, CMO, Atomic…)
    • RSP — responses (CompAck, RetryAck, DBIDResp, DVMResp…)
    • SNP — snoops directed from HN to RN-F
    • DAT — cache-line data (bursts)
  • Each channel has its own credit-based flow control — a producer must hold a credit for the target's buffer before sending.
  • Packets are flits — small fixed chunks that fit in one NoC beat. A full 64 B cache line is typically 4 × 16 B flits on a 128-bit link.
CHI link — 4 virtual channels RN-F HN-F REQ RSP SNP DAT (HN→RN) DAT (RN→HN)
05

CHI Cache States

  • CHI inherits the five-state MOESI-variant from ACE, with slightly different names and tighter specification:
    • UC — UniqueClean (E)
    • UD — UniqueDirty (M)
    • SC — SharedClean (S)
    • SD — SharedDirty (O)
    • I — Invalid
  • Additionally CHI-E introduced the partial-line concept (UCE — Unique Clean Empty, used when a line is reserved for write-allocate but not yet populated).
  • The HN-F's directory entry stores coarse state per line: who holds it, and what overall shared/unique state the line is in — enough to determine which snoops to fan out.

Directory entry (per line, per HN-F)

tag           : address hash slice
shared_vec    : N-bit mask of RNs holding it
dirty         : {clean | dirty-elsewhere | none}
SLC state     : which way of the System
                Level Cache holds the data
pending_txn   : outstanding request FSM ptr
A cache-coherent CHI system is effectively a set of directories at the HN-Fs, each tracking what its address slice is cached as across the RN-Fs.
06

Mesh Topology — CMN-700

Mesh NoC — each tile holds an RN-F (CPU), HN-F (home slice), and/or SN-F (memory port) RN-F + HN-F Tile (0,0) RN-F + HN-F RN-F + HN-F SN-F (HBM) RN-F HN-F slice HN-F slice SN-F (DDR) RN-I (NIC) RN-F RN-F RN-D (SMMU) SN-F (DDR) HN-I (PCIe) RN-F SN-F (HBM) XY routing between tiles. Each tile has a Cross-Point (XP) with credit-managed CHI channels.
07

Request Flow — Read Miss

CPU (RN-F A) misses; address hashes to HN-F@(2,2); line cached at RN-F B in UD RN-F A HN-F (Home) Directory + SLC slice RN-F B (UD) SN-F (DRAM) 1. REQ ReadShared (TxnID=7) 2. SNP SnpShared → RN-F B 3. DAT SnpRespData_I_PD (dirty passed) 4. DAT CompData_SD (CPU A → SharedDirty) 5. RSP CompAck (optional WriteBack to SN-F deferred)

The HN-F knows from its directory that only RN-F B holds the line, so it sends a single targeted snoop — not a broadcast. That's CHI's scaling trick.

08

Snoop Response Types

SnpRespMeaning
SnpResp_ILine not present / Invalid at responder
SnpResp_SCResponder keeps a SharedClean copy
SnpResp_UCResponder had UC → now I (line moved)
SnpResp_SDResponder keeps SharedDirty
SnpRespData_*Same as above + includes cache line data (DAT channel)
SnpRespData_I_PDLine dropped & dirty forwarded ("passed dirty")
RetryAckResponder can't accept snoop now; try later

"Passed Dirty" (PD)

When a snoop responder has UD (dirty) and is being asked to drop the line (e.g. SnpUnique), it sends SnpRespData_I_PD — "here is the data, it's dirty, it's now your responsibility to write back."

The HN-F may hold the dirty data in its System Level Cache (SLC) without immediately writing it back — saves DRAM traffic for popular lines.
09

DMT & DCT — Data Movement Optimisations

DMT — Direct Memory Transfer

  • For plain misses where no cache holds the line, HN-F asks SN-F (DRAM) to send data directly to the requester (RN-F), bypassing the HN-F's critical path.
  • HN-F stays in the loop for tracking (response still comes back), but the data flit does not have to hop through it.

DCT — Direct Cache Transfer

  • When a peer RN-F has the data (Shared or Unique), HN-F can ask that peer to forward it directly to the requester, again bypassing HN-F as a relay.
  • Critical for cache-to-cache latency on contended lines.

Why this matters

On a mesh, going through HN-F means 2× the hop count. DMT/DCT cuts cache-line latency by ~30–40% for the common cases (miss-to-memory, and contended read from peer cache).

DCT makes CHI a true distributed-directory protocol — the HN-F arbitrates, but doesn't always shepherd data through itself.
10

The System Level Cache (SLC)

  • Each HN-F tile in CMN-600/650/700 includes a slice of the System Level Cache — a distributed L3/L4 shared across all RN-Fs.
  • Functions:
    • Reduce DRAM traffic (hit in SLC = no SN-F access).
    • Act as a snoop-filter-equivalent — if a line is in SLC, track who also holds it.
    • Hold dirty data temporarily until an eviction forces write-back.
  • Total SLC in a Neoverse V2 system can be ≥ 128 MB — 2 MB per HN-F × 64 HN-Fs on a 8×8 mesh.

Address hashing

A cache-line address is hashed to decide which HN-F owns its slice — typically a hash of address bits to avoid stripe patterns. This balances both cache capacity and traffic across the mesh.

Neoverse-V2 systems can configure the SLC as either traditional cache or as a scratchpad/pinned-region for latency-critical data (via MPAM region attributes). The interconnect is no longer just a transport — it's a managed shared resource.
11

CHI Issue Timeline

IssueYearKey adds
CHI-A2013Original — Cortex-A57, CCN-504
CHI-B2014Atomics, realms of CMO ops
CHI-C2016Extended QoS, exclusive ops
CHI-D2018MPAM, stashing, persistent-memory hooks
CHI-E2021RME / CCA Realms, Partial-line (UCE)
CHI-F2023CHI-C2C chiplet extensions, UCIe interop hooks

Each issue is in-band backwards-compatible

Older CHI-A masters can talk to a CHI-E home (new features simply not used). The protocol negotiates capabilities at interface bring-up so old IP keeps working.

This matters commercially: a CPU licensee can tape out a CHI-D RN-F and pair it with a CHI-E CMN-700 without rewiring either.
12

CMN-600 / 650 / 700 — Arm's Interconnect IP

IPYearMax meshPeak RN-F
CCN-504/508/5122013–16ring16
CMN-60020168×864
CMN-65020208×8128 with MPAM
CMN-700202212×12128+ with RME
CMN S3 (2024)2024larger mesh + CHI-C2Cchiplet-ready

Every Neoverse server SoC (N1, N2, V1, V2) uses one of these. AWS Graviton, Ampere Altra, NVIDIA Grace, Microsoft Cobalt — all CMN-based.

Scale headline numbers

  • CMN-700 at 8×8 mesh: 64 tiles × typ 2 RN-F + 2 HN-F each = hundreds of coherent participants.
  • SLC capacity: up to 512 MB distributed.
  • Peak bandwidth: >1 TB/s to memory across 8 DDR channels + HBM3 stacks.
NVIDIA Grace uses CMN-700 with 72 Neoverse V2 cores; AWS Graviton 4 uses it with 96 cores. The same IP at different tile counts.
13

Retry & Credits

  • Every CHI channel has a finite buffer at the receiver.
  • Senders operate with credits: you don't send a packet unless you hold a credit for its channel at the target. Credits are returned after the packet is consumed.
  • If the sender has no credit available, the packet waits. No packet is dropped.
  • For request channels, a target that cannot accept right now responds RetryAck + PCrdGrant. The requester must later re-send carrying the granted credit (PCrdType). Bounded retry — no livelock.

Why credits?

Credit-based flow control is the standard solution for lossless, bounded-buffer interconnects (InfiniBand, PCIe, modern NoCs). It avoids the deadlock possibilities of store-and-forward while keeping buffers small.

CHI-E added stronger ordering-domain definitions so credits never cross ordering-domain boundaries — important for CCA (Realms).
14

DVM on CHI

  • DVM on CHI is a dedicated request opcode (DVMOp) rather than a special AC-channel encoding.
  • Flows:
    • An RN-F or RN-D (typically SMMU) sends a DVMOp REQ to a Misc Node (MN) — the DVM serialisation point.
    • MN forwards SnpDVMOp to every RN-F + RN-D in the DVM domain.
    • Each peer acknowledges; MN aggregates; requester gets final DVMComplete.
  • Payload: TLBI address range, ASID, VMID, IC invalidate, sync barrier.
  • CHI-E extended DVM to handle RME (Realm) invalidates separately.

DVM Scaling

At 128 RN-Fs, a single TLBI IS that must fan out to every peer is expensive. Neoverse-class systems pipeline DVM hard — multiple in-flight DVMOps can overlap, and DVM Sync only blocks on the ones that matter.

The Misc Node (MN) is the single serialisation point. Systems sometimes need multiple MNs for DVM scaling — the spec allows it.
15

MPAM — Partitioning on CHI

  • MPAM — Memory-system Partitioning and Monitoring — is Arm's answer to CAT (Cache Allocation Technology) on x86.
  • CHI-D and CHI-E carry a PARTID on every transaction (plus PMG — Performance Monitoring Group).
  • HN-F tiles enforce per-PARTID policies:
    • SLC way allocation quotas
    • Memory-bandwidth fair shares
    • Miss-rate telemetry
  • Lets a hypervisor prevent a noisy-neighbour VM from thrashing the system cache or saturating memory bandwidth.

Why this lives in the interconnect

CPU-side partitioning (pinning cores, LLC ways) isn't enough — the interconnect itself has contended resources (SLC, SN-F queues). Putting PARTID on the transaction lets every shared resource enforce policy.

Real deployment: Kubernetes on Graviton uses MPAM per container so chatty tenants don't starve latency-critical ones. Sysfs surface is resctrl-like.
16

CHI for Realms — CCA & RME

  • Arm Confidential Compute Architecture (CCA, Armv9.2) introduces Realms — hypervisor-unreadable virtual machines.
  • CHI-E adds attributes to support this:
    • 4-way security state per transaction — Root / Realm / Secure / Non-Secure.
    • Granular permission checks at the HN-F (Realm transactions can't read Secure memory, etc.).
    • SLC tagging ensures Realm lines aren't returned to Non-Secure readers.
    • DVM semantics extended to invalidate Realm TLB entries separately.

RMM on the critical path

The Realm Management Monitor (software) runs in EL3-R and programs GPT (Granule Protection Tables) that enforce which physical memory belongs to which Realm. The interconnect (HN-F / SMMU / memory controller) enforces those tables at every transaction.

GPT checks: every cache-line access at the HN-F is gated by a per-granule permission table. This is how CCA protects Realms from a compromised hypervisor.
17

CHI vs ACE — Side by Side

AspectACECHI
LayerSignal-levelPacket-based
Scope≤ ~8 masters128+ coherent nodes
Snoop modelBroadcast (filter-assisted)Home-based directory
TopologyArbiter / small crossbarAny NoC (ring / mesh)
Data flowAlways via CCIDMT / DCT direct paths
Cache levelsUp to L3 sharedFull System Level Cache
AtomicsACE5 onlyFrom CHI-B
Partitioningno (per-master QoS hint)MPAM
SecurityTrustZone onlyRME / CCA Realms

When to pick which

  • Mobile/embedded SoC with 2-cluster CPU: ACE (cheaper wires).
  • Server or datacentre SoC with 16+ CPUs: CHI (scales).
  • Mixed: CHI at system level, ACE bridges for legacy masters or cluster-local coherency.
18

Verification — Why CHI is Hard

  • Huge state space: 128+ RN-Fs, each with independent caches; each line has 5 states × N possible holders.
  • Packet reordering: NoC may reorder between channels.
  • Ordering rules are subtle: some transactions serialise at HN, others at RN; CompAck ordering vs DAT arrival; PD forwarding.
  • Modern CHI verification is a mix of:
    • Formal (Jasper / Questa) on protocol properties at each interface.
    • UVM constrained-random on the mesh-level fabric.
    • Emulation (Palladium / Veloce) for system-level boot on real software.
    • Post-silicon coherency stress tests (memory ordering litmus).

Arm ABVIP for CHI

Arm publishes formal assertion libraries for every CHI interface. SoC houses plug these in at every port (RN-F boundary, HN-F boundary, SN-F boundary) and run 24-hour formal proofs to cover protocol deadlock and data integrity.

Litmus testing: post-silicon, teams run programs like Arm's litmus7 / herd7 that generate ordering-rule stress tests and run millions of iterations to catch coherency bugs escaping verification.
19

Minimum-Viable CHI — and Why There Isn't One

Why CHI resists "minimal"

  • CHI is packet-based with 4 virtual channels + credit flow control + a directory.
  • The smallest compliant RN-F must: respond to every inbound SNP type, maintain per-line cache state, track outstanding TxnIDs, generate CompAck, and honour ordering domains. ~2–4 k flops before you add any actual caching.
  • There is no "just wires" CHI. The protocol's coherency contract forbids it.

What can be minimal: an RN-I-ish bridge

A non-coherent CHI Request Node (no caches, no SNP channel participation) wrapped around an AXI4 master is the simplest agent that can plug into a CHI mesh — useful for DMA and legacy IP integration.

// Sketch: CHI RN-I built as a thin state-machine
// in front of an existing AXI4 master.
// No caches; never receives SNP on the RN-I port.
module chi_rni_axi_bridge (
  input  logic        clk, rstn,
  // --- CHI RN-I ports (to interconnect) ---
  input  logic        chi_req_credit,   // credit from XP
  output logic        chi_req_flitv,
  output logic [63:0] chi_req_flit,     // opcode + addr + TxnID
  input  logic        chi_rsp_flitv,
  input  logic [31:0] chi_rsp_flit,     // DBIDResp / Comp
  input  logic        chi_dat_flitv,
  input  logic [255:0] chi_dat_flit,    // DAT payload
  // --- AXI4 slave port (to the non-coherent master) ---
  input  logic        ARVALID, output logic ARREADY,
  input  logic [31:0] ARADDR,
  output logic        RVALID,  input  logic RREADY,
  output logic [31:0] RDATA,   output logic RLAST
  // ... AW/W/B tied off for brevity
);
  typedef enum logic [1:0] {IDLE, REQ, WAIT, RSP} st_t;
  st_t st, nxt;
  logic [7:0] txnid;

  always_ff @(posedge clk or negedge rstn)
    if (!rstn) {st, txnid} <= '0;
    else       {st, txnid} <= {nxt, txnid + (st==REQ)};

  always_comb begin
    nxt = st; chi_req_flitv = 1'b0;  ARREADY = 1'b0;
    RVALID = 1'b0; RLAST = 1'b0;
    unique case (st)
      IDLE: if (ARVALID)                 nxt = REQ;
      REQ : if (chi_req_credit) begin
              chi_req_flitv = 1'b1;      nxt = WAIT;
            end
      WAIT: if (chi_dat_flitv)           nxt = RSP;
      RSP : begin RVALID = 1'b1; RLAST = 1'b1;
              if (RREADY)                nxt = IDLE;
            end
    endcase
  end

  assign chi_req_flit = {8'h00 /*ReadNoSnp*/,
                         ARADDR, txnid, 16'h0};
  assign RDATA        = chi_dat_flit[31:0];
endmodule

A real product RN-I adds credit counters per channel, retry on RetryAck, and a proper outstanding-txn scoreboard — but this sketch fits on one page and exposes where the work is.

20

Interview-Ready Takeaways

  • "Why does CHI exist if ACE exists?" → ACE is signal-level; scales to ~8 coherent masters. CHI is packet-based; scales to 128+ by using a home-based directory instead of broadcast snoop.
  • "What are the four CHI channels?" → REQ, RSP, SNP, DAT. Each credit-flow-controlled, independently ordered.
  • "What's an HN-F?" → Home Node (Fully-coherent): owns a slice of the physical address space, holds a directory + System Level Cache slice, serialises coherent transactions for that slice.
  • "What is DMT?" → Direct Memory Transfer: HN-F asks SN-F to send data straight to the requester, bypassing HN-F as a relay. Saves hops on cold misses.
  • "What is DCT?" → Direct Cache Transfer: on a peer hit, the snoop responder forwards data directly to the new requester instead of via HN-F.
  • "Where does the System Level Cache live?" → Distributed across HN-F tiles; each tile holds a slice. Address-hashed.
  • "What does MPAM do?" → Per-transaction PARTID lets the interconnect enforce SLC way allocations and memory bandwidth shares per partition (cloud tenant / VM).
  • "Why CHI-E?" → Added RME / Realm attributes so CCA (confidential compute) can enforce its four-way security state end-to-end through the interconnect.
21

References

Arm Ltd.AMBA 5 CHI Architecture Specification (IHI 0050), Issues A through F
Arm Ltd.Arm CoreLink CMN-600, CMN-650, CMN-700 Technical Reference Manuals
Arm Ltd.Arm Neoverse N1 / N2 / V1 / V2 Software Optimization Guide — mesh traffic and SLC hit rate tuning
Arm Ltd.Arm MPAM Architecture Specification and CHI carrier definitions for MPAM
Arm Ltd.Arm Realm Management Extension (RME) Specification and CHI-E Realm annex
Biswas, A. et al. — "CMN-700: A Mesh Network for Next-Generation Arm Servers" — HotChips 2022 tutorial
Alves, A. et al.AWS Graviton architecture — AWS re:Invent 2020 / 2023 tech deep dives
NVIDIANVIDIA Grace CPU Architecture Whitepaper — 72-core CMN-700 topology
Owens, J. et al. — SystemC / gem5 mesh models used in academic CHI research
Wikipedia — "Coherent Hub Interface" and "Arm CoreLink" — well-sourced cross-references

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.