CHI is a layered, packet-based coherent protocol with four message channels, typed node roles, and a home-based directory — designed to scale from 2 to 128+ coherent agents without changing the protocol.
| Node | Role | Example |
|---|---|---|
| RN-F | Request Node, Fully-coherent — has caches, issues coherent reqs | Cortex-A / Neoverse CPU cluster |
| RN-I | Request Node, I/O-coherent — no caches, snooped by nobody | Non-coherent DMA, NIC, AXI bridge |
| RN-D | Request Node with DVM — I/O-coherent plus DVM participation | SMMU (MMU-600 / MMU-700) |
| HN-F | Home Node, Fully-coherent — directory + System Level Cache slice | CMN-700 HN-F tile |
| HN-I | Home Node, I/O — gateway to AXI I/O region | CMN-700 HN-I tile |
| SN-F | Slave Node, Fully-coherent — memory endpoint | DDR / HBM controller, CXL.mem |
| MN | Misc Node — broadcasts, DVM sync, etc. | Central management block |
In ACE the coherency logic sat inside the CCI. In CHI it lives in the distributed HN-F tiles. Each HN-F owns a slice of the physical address space; all coherent requests for addresses in its slice flow through it.
tag : address hash slice
shared_vec : N-bit mask of RNs holding it
dirty : {clean | dirty-elsewhere | none}
SLC state : which way of the System
Level Cache holds the data
pending_txn : outstanding request FSM ptr
The HN-F knows from its directory that only RN-F B holds the line, so it sends a single targeted snoop — not a broadcast. That's CHI's scaling trick.
| SnpResp | Meaning |
|---|---|
| SnpResp_I | Line not present / Invalid at responder |
| SnpResp_SC | Responder keeps a SharedClean copy |
| SnpResp_UC | Responder had UC → now I (line moved) |
| SnpResp_SD | Responder keeps SharedDirty |
| SnpRespData_* | Same as above + includes cache line data (DAT channel) |
| SnpRespData_I_PD | Line dropped & dirty forwarded ("passed dirty") |
| RetryAck | Responder can't accept snoop now; try later |
When a snoop responder has UD (dirty) and is being asked to drop the line (e.g. SnpUnique), it sends SnpRespData_I_PD — "here is the data, it's dirty, it's now your responsibility to write back."
On a mesh, going through HN-F means 2× the hop count. DMT/DCT cuts cache-line latency by ~30–40% for the common cases (miss-to-memory, and contended read from peer cache).
A cache-line address is hashed to decide which HN-F owns its slice — typically a hash of address bits to avoid stripe patterns. This balances both cache capacity and traffic across the mesh.
| Issue | Year | Key adds |
|---|---|---|
| CHI-A | 2013 | Original — Cortex-A57, CCN-504 |
| CHI-B | 2014 | Atomics, realms of CMO ops |
| CHI-C | 2016 | Extended QoS, exclusive ops |
| CHI-D | 2018 | MPAM, stashing, persistent-memory hooks |
| CHI-E | 2021 | RME / CCA Realms, Partial-line (UCE) |
| CHI-F | 2023 | CHI-C2C chiplet extensions, UCIe interop hooks |
Older CHI-A masters can talk to a CHI-E home (new features simply not used). The protocol negotiates capabilities at interface bring-up so old IP keeps working.
| IP | Year | Max mesh | Peak RN-F |
|---|---|---|---|
| CCN-504/508/512 | 2013–16 | ring | 16 |
| CMN-600 | 2016 | 8×8 | 64 |
| CMN-650 | 2020 | 8×8 | 128 with MPAM |
| CMN-700 | 2022 | 12×12 | 128+ with RME |
| CMN S3 (2024) | 2024 | larger mesh + CHI-C2C | chiplet-ready |
Every Neoverse server SoC (N1, N2, V1, V2) uses one of these. AWS Graviton, Ampere Altra, NVIDIA Grace, Microsoft Cobalt — all CMN-based.
Credit-based flow control is the standard solution for lossless, bounded-buffer interconnects (InfiniBand, PCIe, modern NoCs). It avoids the deadlock possibilities of store-and-forward while keeping buffers small.
DVMOp) rather than a special AC-channel encoding.DVMOp REQ to a Misc Node (MN) — the DVM serialisation point.At 128 RN-Fs, a single TLBI IS that must fan out to every peer is expensive. Neoverse-class systems pipeline DVM hard — multiple in-flight DVMOps can overlap, and DVM Sync only blocks on the ones that matter.
CPU-side partitioning (pinning cores, LLC ways) isn't enough — the interconnect itself has contended resources (SLC, SN-F queues). Putting PARTID on the transaction lets every shared resource enforce policy.
resctrl-like.
The Realm Management Monitor (software) runs in EL3-R and programs GPT (Granule Protection Tables) that enforce which physical memory belongs to which Realm. The interconnect (HN-F / SMMU / memory controller) enforces those tables at every transaction.
| Aspect | ACE | CHI |
|---|---|---|
| Layer | Signal-level | Packet-based |
| Scope | ≤ ~8 masters | 128+ coherent nodes |
| Snoop model | Broadcast (filter-assisted) | Home-based directory |
| Topology | Arbiter / small crossbar | Any NoC (ring / mesh) |
| Data flow | Always via CCI | DMT / DCT direct paths |
| Cache levels | Up to L3 shared | Full System Level Cache |
| Atomics | ACE5 only | From CHI-B |
| Partitioning | no (per-master QoS hint) | MPAM |
| Security | TrustZone only | RME / CCA Realms |
Arm publishes formal assertion libraries for every CHI interface. SoC houses plug these in at every port (RN-F boundary, HN-F boundary, SN-F boundary) and run 24-hour formal proofs to cover protocol deadlock and data integrity.
A non-coherent CHI Request Node (no caches, no SNP channel participation) wrapped around an AXI4 master is the simplest agent that can plug into a CHI mesh — useful for DMA and legacy IP integration.
// Sketch: CHI RN-I built as a thin state-machine
// in front of an existing AXI4 master.
// No caches; never receives SNP on the RN-I port.
module chi_rni_axi_bridge (
input logic clk, rstn,
// --- CHI RN-I ports (to interconnect) ---
input logic chi_req_credit, // credit from XP
output logic chi_req_flitv,
output logic [63:0] chi_req_flit, // opcode + addr + TxnID
input logic chi_rsp_flitv,
input logic [31:0] chi_rsp_flit, // DBIDResp / Comp
input logic chi_dat_flitv,
input logic [255:0] chi_dat_flit, // DAT payload
// --- AXI4 slave port (to the non-coherent master) ---
input logic ARVALID, output logic ARREADY,
input logic [31:0] ARADDR,
output logic RVALID, input logic RREADY,
output logic [31:0] RDATA, output logic RLAST
// ... AW/W/B tied off for brevity
);
typedef enum logic [1:0] {IDLE, REQ, WAIT, RSP} st_t;
st_t st, nxt;
logic [7:0] txnid;
always_ff @(posedge clk or negedge rstn)
if (!rstn) {st, txnid} <= '0;
else {st, txnid} <= {nxt, txnid + (st==REQ)};
always_comb begin
nxt = st; chi_req_flitv = 1'b0; ARREADY = 1'b0;
RVALID = 1'b0; RLAST = 1'b0;
unique case (st)
IDLE: if (ARVALID) nxt = REQ;
REQ : if (chi_req_credit) begin
chi_req_flitv = 1'b1; nxt = WAIT;
end
WAIT: if (chi_dat_flitv) nxt = RSP;
RSP : begin RVALID = 1'b1; RLAST = 1'b1;
if (RREADY) nxt = IDLE;
end
endcase
end
assign chi_req_flit = {8'h00 /*ReadNoSnp*/,
ARADDR, txnid, 16'h0};
assign RDATA = chi_dat_flit[31:0];
endmodule
A real product RN-I adds credit counters per channel, retry on RetryAck, and a proper outstanding-txn scoreboard — but this sketch fits on one page and exposes where the work is.
Arm Ltd. — AMBA 5 CHI Architecture Specification (IHI 0050), Issues A through F
Arm Ltd. — Arm CoreLink CMN-600, CMN-650, CMN-700 Technical Reference Manuals
Arm Ltd. — Arm Neoverse N1 / N2 / V1 / V2 Software Optimization Guide — mesh traffic and SLC hit rate tuning
Arm Ltd. — Arm MPAM Architecture Specification and CHI carrier definitions for MPAM
Arm Ltd. — Arm Realm Management Extension (RME) Specification and CHI-E Realm annex
Biswas, A. et al. — "CMN-700: A Mesh Network for Next-Generation Arm Servers" — HotChips 2022 tutorial
Alves, A. et al. — AWS Graviton architecture — AWS re:Invent 2020 / 2023 tech deep dives
NVIDIA — NVIDIA Grace CPU Architecture Whitepaper — 72-core CMN-700 topology
Owens, J. et al. — SystemC / gem5 mesh models used in academic CHI research
Wikipedia — "Coherent Hub Interface" and "Arm CoreLink" — well-sourced cross-references
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.