ARM AMBA · PRESENTATION 05

CHI — Coherent Hub Interface

Packet-Based Coherency for 128+ Nodes · The Protocol Behind Neoverse

REQ · RSP · SNP · DAT · RN-F · HN-F · SN-F · CMN-600/650/700 · CHI-A/B/C/D/E/F

Navigate: → ← | Overview: Esc | Fullscreen: F

Why CHI?

ACE was a signal-level protocol — every master port physically owned AC/CR/CD wires running to the interconnect.
At 2 clusters: fine. At 8 clusters: the snoop fan-out starts to hurt. At 32+ coherent agents: the wire count becomes unmanageable.
For servers and high-end SoCs Arm needed a transport-agnostic, packet-based protocol: any coherent message becomes a packet that the interconnect routes over whatever topology fits (ring, mesh, hybrid).
First delivered as AMBA 5 CHI Issue A in 2013 alongside Arm's CCN-504 interconnect; refined through CHI-B/C/D/E/F as Neoverse and CMN-600/650/700 scaled up.

CHI in one sentence

CHI is a layered, packet-based coherent protocol with four message channels, typed node roles, and a home-based directory — designed to scale from 2 to 128+ coherent agents without changing the protocol.

The specification (IHI 0050) is unusually structured: a transport-agnostic protocol layer above the link-layer flit rules — you can implement CHI over any NoC that preserves per-channel ordering.

CHI Node Types

Node	Role	Example
RN-F	Request Node, Fully-coherent — has caches, issues coherent reqs	Cortex-A / Neoverse CPU cluster
RN-I	Request Node, I/O-coherent — no caches, snooped by nobody	Non-coherent DMA, NIC, AXI bridge
RN-D	Request Node with DVM — I/O-coherent plus DVM participation	SMMU (MMU-600 / MMU-700)
HN-F	Home Node, Fully-coherent — directory + System Level Cache slice	CMN-700 HN-F tile
HN-I	Home Node, I/O — gateway to AXI I/O region	CMN-700 HN-I tile
SN-F	Slave Node, Fully-coherent — memory endpoint	DDR / HBM controller, CXL.mem
MN	Misc Node — broadcasts, DVM sync, etc.	Central management block

The Home Node is the star

In ACE the coherency logic sat inside the CCI. In CHI it lives in the distributed HN-F tiles. Each HN-F owns a slice of the physical address space; all coherent requests for addresses in its slice flow through it.

This address-hash-to-home mapping is what turns CHI from "snoop everyone" into "ask the home, which knows who to ask."

The Four Message Channels

Every CHI link (a bidirectional port between two nodes) carries four virtual channels:
- REQ — request from RN / HN (ReadShared, ReadUnique, WriteBack, CMO, Atomic…)
- RSP — responses (CompAck, RetryAck, DBIDResp, DVMResp…)
- SNP — snoops directed from HN to RN-F
- DAT — cache-line data (bursts)
Each channel has its own credit-based flow control — a producer must hold a credit for the target's buffer before sending.
Packets are flits — small fixed chunks that fit in one NoC beat. A full 64 B cache line is typically 4 × 16 B flits on a 128-bit link.

CHI Cache States

CHI inherits the five-state MOESI-variant from ACE, with slightly different names and tighter specification:
- UC — UniqueClean (E)
- UD — UniqueDirty (M)
- SC — SharedClean (S)
- SD — SharedDirty (O)
- I — Invalid
Additionally CHI-E introduced the partial-line concept (UCE — Unique Clean Empty, used when a line is reserved for write-allocate but not yet populated).
The HN-F's directory entry stores coarse state per line: who holds it, and what overall shared/unique state the line is in — enough to determine which snoops to fan out.

Directory entry (per line, per HN-F)

tag           : address hash slice
shared_vec    : N-bit mask of RNs holding it
dirty         : {clean | dirty-elsewhere | none}
SLC state     : which way of the System
                Level Cache holds the data
pending_txn   : outstanding request FSM ptr

A cache-coherent CHI system is effectively a set of directories at the HN-Fs, each tracking what its address slice is cached as across the RN-Fs.

Mesh Topology — CMN-700

Request Flow — Read Miss

The HN-F knows from its directory that only RN-F B holds the line, so it sends a single targeted snoop — not a broadcast. That's CHI's scaling trick.

Snoop Response Types

SnpResp	Meaning
SnpResp_I	Line not present / Invalid at responder
SnpResp_SC	Responder keeps a SharedClean copy
SnpResp_UC	Responder had UC → now I (line moved)
SnpResp_SD	Responder keeps SharedDirty
SnpRespData_*	Same as above + includes cache line data (DAT channel)
SnpRespData_I_PD	Line dropped & dirty forwarded ("passed dirty")
RetryAck	Responder can't accept snoop now; try later

"Passed Dirty" (PD)

When a snoop responder has UD (dirty) and is being asked to drop the line (e.g. SnpUnique), it sends SnpRespData_I_PD — "here is the data, it's dirty, it's now your responsibility to write back."

The HN-F may hold the dirty data in its System Level Cache (SLC) without immediately writing it back — saves DRAM traffic for popular lines.

DMT & DCT — Data Movement Optimisations

DMT — Direct Memory Transfer

For plain misses where no cache holds the line, HN-F asks SN-F (DRAM) to send data directly to the requester (RN-F), bypassing the HN-F's critical path.
HN-F stays in the loop for tracking (response still comes back), but the data flit does not have to hop through it.

DCT — Direct Cache Transfer

When a peer RN-F has the data (Shared or Unique), HN-F can ask that peer to forward it directly to the requester, again bypassing HN-F as a relay.
Critical for cache-to-cache latency on contended lines.

Why this matters

On a mesh, going through HN-F means 2× the hop count. DMT/DCT cuts cache-line latency by ~30–40% for the common cases (miss-to-memory, and contended read from peer cache).

DCT makes CHI a true distributed-directory protocol — the HN-F arbitrates, but doesn't always shepherd data through itself.

The System Level Cache (SLC)

Each HN-F tile in CMN-600/650/700 includes a slice of the System Level Cache — a distributed L3/L4 shared across all RN-Fs.
Functions:
- Reduce DRAM traffic (hit in SLC = no SN-F access).
- Act as a snoop-filter-equivalent — if a line is in SLC, track who also holds it.
- Hold dirty data temporarily until an eviction forces write-back.
Total SLC in a Neoverse V2 system can be ≥ 128 MB — 2 MB per HN-F × 64 HN-Fs on a 8×8 mesh.

Address hashing

A cache-line address is hashed to decide which HN-F owns its slice — typically a hash of address bits to avoid stripe patterns. This balances both cache capacity and traffic across the mesh.

Neoverse-V2 systems can configure the SLC as either traditional cache or as a scratchpad/pinned-region for latency-critical data (via MPAM region attributes). The interconnect is no longer just a transport — it's a managed shared resource.

CHI Issue Timeline

Issue	Year	Key adds
CHI-A	2013	Original — Cortex-A57, CCN-504
CHI-B	2014	Atomics, realms of CMO ops
CHI-C	2016	Extended QoS, exclusive ops
CHI-D	2018	MPAM, stashing, persistent-memory hooks
CHI-E	2021	RME / CCA Realms, Partial-line (UCE)
CHI-F	2023	CHI-C2C chiplet extensions, UCIe interop hooks

Each issue is in-band backwards-compatible

Older CHI-A masters can talk to a CHI-E home (new features simply not used). The protocol negotiates capabilities at interface bring-up so old IP keeps working.

This matters commercially: a CPU licensee can tape out a CHI-D RN-F and pair it with a CHI-E CMN-700 without rewiring either.

CMN-600 / 650 / 700 — Arm's Interconnect IP

IP	Year	Max mesh	Peak RN-F
CCN-504/508/512	2013–16	ring	16
CMN-600	2016	8×8	64
CMN-650	2020	8×8	128 with MPAM
CMN-700	2022	12×12	128+ with RME
CMN S3 (2024)	2024	larger mesh + CHI-C2C	chiplet-ready

Every Neoverse server SoC (N1, N2, V1, V2) uses one of these. AWS Graviton, Ampere Altra, NVIDIA Grace, Microsoft Cobalt — all CMN-based.

Scale headline numbers

CMN-700 at 8×8 mesh: 64 tiles × typ 2 RN-F + 2 HN-F each = hundreds of coherent participants.
SLC capacity: up to 512 MB distributed.
Peak bandwidth: >1 TB/s to memory across 8 DDR channels + HBM3 stacks.

NVIDIA Grace uses CMN-700 with 72 Neoverse V2 cores; AWS Graviton 4 uses it with 96 cores. The same IP at different tile counts.

Retry & Credits

Every CHI channel has a finite buffer at the receiver.
Senders operate with credits: you don't send a packet unless you hold a credit for its channel at the target. Credits are returned after the packet is consumed.
If the sender has no credit available, the packet waits. No packet is dropped.
For request channels, a target that cannot accept right now responds RetryAck + PCrdGrant. The requester must later re-send carrying the granted credit (PCrdType). Bounded retry — no livelock.

Why credits?

Credit-based flow control is the standard solution for lossless, bounded-buffer interconnects (InfiniBand, PCIe, modern NoCs). It avoids the deadlock possibilities of store-and-forward while keeping buffers small.

CHI-E added stronger ordering-domain definitions so credits never cross ordering-domain boundaries — important for CCA (Realms).

DVM on CHI

DVM on CHI is a dedicated request opcode (DVMOp) rather than a special AC-channel encoding.
Flows:
- An RN-F or RN-D (typically SMMU) sends a DVMOp REQ to a Misc Node (MN) — the DVM serialisation point.
- MN forwards SnpDVMOp to every RN-F + RN-D in the DVM domain.
- Each peer acknowledges; MN aggregates; requester gets final DVMComplete.
Payload: TLBI address range, ASID, VMID, IC invalidate, sync barrier.
CHI-E extended DVM to handle RME (Realm) invalidates separately.

DVM Scaling

At 128 RN-Fs, a single TLBI IS that must fan out to every peer is expensive. Neoverse-class systems pipeline DVM hard — multiple in-flight DVMOps can overlap, and DVM Sync only blocks on the ones that matter.

The Misc Node (MN) is the single serialisation point. Systems sometimes need multiple MNs for DVM scaling — the spec allows it.

MPAM — Partitioning on CHI

MPAM — Memory-system Partitioning and Monitoring — is Arm's answer to CAT (Cache Allocation Technology) on x86.
CHI-D and CHI-E carry a PARTID on every transaction (plus PMG — Performance Monitoring Group).
HN-F tiles enforce per-PARTID policies:
- SLC way allocation quotas
- Memory-bandwidth fair shares
- Miss-rate telemetry
Lets a hypervisor prevent a noisy-neighbour VM from thrashing the system cache or saturating memory bandwidth.

Why this lives in the interconnect

CPU-side partitioning (pinning cores, LLC ways) isn't enough — the interconnect itself has contended resources (SLC, SN-F queues). Putting PARTID on the transaction lets every shared resource enforce policy.

Real deployment: Kubernetes on Graviton uses MPAM per container so chatty tenants don't starve latency-critical ones. Sysfs surface is resctrl-like.

CHI for Realms — CCA & RME

Arm Confidential Compute Architecture (CCA, Armv9.2) introduces Realms — hypervisor-unreadable virtual machines.
CHI-E adds attributes to support this:
- 4-way security state per transaction — Root / Realm / Secure / Non-Secure.
- Granular permission checks at the HN-F (Realm transactions can't read Secure memory, etc.).
- SLC tagging ensures Realm lines aren't returned to Non-Secure readers.
- DVM semantics extended to invalidate Realm TLB entries separately.

RMM on the critical path

The Realm Management Monitor (software) runs in EL3-R and programs GPT (Granule Protection Tables) that enforce which physical memory belongs to which Realm. The interconnect (HN-F / SMMU / memory controller) enforces those tables at every transaction.

GPT checks: every cache-line access at the HN-F is gated by a per-granule permission table. This is how CCA protects Realms from a compromised hypervisor.

CHI vs ACE — Side by Side

Aspect	ACE	CHI
Layer	Signal-level	Packet-based
Scope	≤ ~8 masters	128+ coherent nodes
Snoop model	Broadcast (filter-assisted)	Home-based directory
Topology	Arbiter / small crossbar	Any NoC (ring / mesh)
Data flow	Always via CCI	DMT / DCT direct paths
Cache levels	Up to L3 shared	Full System Level Cache
Atomics	ACE5 only	From CHI-B
Partitioning	no (per-master QoS hint)	MPAM
Security	TrustZone only	RME / CCA Realms

When to pick which

Mobile/embedded SoC with 2-cluster CPU: ACE (cheaper wires).
Server or datacentre SoC with 16+ CPUs: CHI (scales).
Mixed: CHI at system level, ACE bridges for legacy masters or cluster-local coherency.

Verification — Why CHI is Hard

Huge state space: 128+ RN-Fs, each with independent caches; each line has 5 states × N possible holders.
Packet reordering: NoC may reorder between channels.
Ordering rules are subtle: some transactions serialise at HN, others at RN; CompAck ordering vs DAT arrival; PD forwarding.
Modern CHI verification is a mix of:
- Formal (Jasper / Questa) on protocol properties at each interface.
- UVM constrained-random on the mesh-level fabric.
- Emulation (Palladium / Veloce) for system-level boot on real software.
- Post-silicon coherency stress tests (memory ordering litmus).

Arm ABVIP for CHI

Arm publishes formal assertion libraries for every CHI interface. SoC houses plug these in at every port (RN-F boundary, HN-F boundary, SN-F boundary) and run 24-hour formal proofs to cover protocol deadlock and data integrity.

Litmus testing: post-silicon, teams run programs like Arm's litmus7 / herd7 that generate ordering-rule stress tests and run millions of iterations to catch coherency bugs escaping verification.

Minimum-Viable CHI — and Why There Isn't One

Why CHI resists "minimal"

CHI is packet-based with 4 virtual channels + credit flow control + a directory.
The smallest compliant RN-F must: respond to every inbound SNP type, maintain per-line cache state, track outstanding TxnIDs, generate CompAck, and honour ordering domains. ~2–4 k flops before you add any actual caching.
There is no "just wires" CHI. The protocol's coherency contract forbids it.

What can be minimal: an RN-I-ish bridge

A non-coherent CHI Request Node (no caches, no SNP channel participation) wrapped around an AXI4 master is the simplest agent that can plug into a CHI mesh — useful for DMA and legacy IP integration.

// Sketch: CHI RN-I built as a thin state-machine
// in front of an existing AXI4 master.
// No caches; never receives SNP on the RN-I port.
module chi_rni_axi_bridge (
  input  logic        clk, rstn,
  // --- CHI RN-I ports (to interconnect) ---
  input  logic        chi_req_credit,   // credit from XP
  output logic        chi_req_flitv,
  output logic [63:0] chi_req_flit,     // opcode + addr + TxnID
  input  logic        chi_rsp_flitv,
  input  logic [31:0] chi_rsp_flit,     // DBIDResp / Comp
  input  logic        chi_dat_flitv,
  input  logic [255:0] chi_dat_flit,    // DAT payload
  // --- AXI4 slave port (to the non-coherent master) ---
  input  logic        ARVALID, output logic ARREADY,
  input  logic [31:0] ARADDR,
  output logic        RVALID,  input  logic RREADY,
  output logic [31:0] RDATA,   output logic RLAST
  // ... AW/W/B tied off for brevity
);
  typedef enum logic [1:0] {IDLE, REQ, WAIT, RSP} st_t;
  st_t st, nxt;
  logic [7:0] txnid;

  always_ff @(posedge clk or negedge rstn)
    if (!rstn) {st, txnid} <= '0;
    else       {st, txnid} <= {nxt, txnid + (st==REQ)};

  always_comb begin
    nxt = st; chi_req_flitv = 1'b0;  ARREADY = 1'b0;
    RVALID = 1'b0; RLAST = 1'b0;
    unique case (st)
      IDLE: if (ARVALID)                 nxt = REQ;
      REQ : if (chi_req_credit) begin
              chi_req_flitv = 1'b1;      nxt = WAIT;
            end
      WAIT: if (chi_dat_flitv)           nxt = RSP;
      RSP : begin RVALID = 1'b1; RLAST = 1'b1;
              if (RREADY)                nxt = IDLE;
            end
    endcase
  end

  assign chi_req_flit = {8'h00 /*ReadNoSnp*/,
                         ARADDR, txnid, 16'h0};
  assign RDATA        = chi_dat_flit[31:0];
endmodule

A real product RN-I adds credit counters per channel, retry on RetryAck, and a proper outstanding-txn scoreboard — but this sketch fits on one page and exposes where the work is.

Interview-Ready Takeaways

"Why does CHI exist if ACE exists?" → ACE is signal-level; scales to ~8 coherent masters. CHI is packet-based; scales to 128+ by using a home-based directory instead of broadcast snoop.
"What are the four CHI channels?" → REQ, RSP, SNP, DAT. Each credit-flow-controlled, independently ordered.
"What's an HN-F?" → Home Node (Fully-coherent): owns a slice of the physical address space, holds a directory + System Level Cache slice, serialises coherent transactions for that slice.
"What is DMT?" → Direct Memory Transfer: HN-F asks SN-F to send data straight to the requester, bypassing HN-F as a relay. Saves hops on cold misses.

"What is DCT?" → Direct Cache Transfer: on a peer hit, the snoop responder forwards data directly to the new requester instead of via HN-F.
"Where does the System Level Cache live?" → Distributed across HN-F tiles; each tile holds a slice. Address-hashed.
"What does MPAM do?" → Per-transaction PARTID lets the interconnect enforce SLC way allocations and memory bandwidth shares per partition (cloud tenant / VM).
"Why CHI-E?" → Added RME / Realm attributes so CCA (confidential compute) can enforce its four-way security state end-to-end through the interconnect.

References

Arm Ltd. — AMBA 5 CHI Architecture Specification (IHI 0050), Issues A through F
Arm Ltd. — Arm CoreLink CMN-600, CMN-650, CMN-700 Technical Reference Manuals
Arm Ltd. — Arm Neoverse N1 / N2 / V1 / V2 Software Optimization Guide — mesh traffic and SLC hit rate tuning
Arm Ltd. — Arm MPAM Architecture Specification and CHI carrier definitions for MPAM
Arm Ltd. — Arm Realm Management Extension (RME) Specification and CHI-E Realm annex
Biswas, A. et al. — "CMN-700: A Mesh Network for Next-Generation Arm Servers" — HotChips 2022 tutorial
Alves, A. et al. — AWS Graviton architecture — AWS re:Invent 2020 / 2023 tech deep dives
NVIDIA — NVIDIA Grace CPU Architecture Whitepaper — 72-core CMN-700 topology
Owens, J. et al. — SystemC / gem5 mesh models used in academic CHI research
Wikipedia — "Coherent Hub Interface" and "Arm CoreLink" — well-sourced cross-references

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.