ARM AMBA · PRESENTATION 06

The Future of AMBA

Chiplets · CXL Coexistence · AI Accelerators · Confidential Compute

CHI-C2C · UCIe · MPAM · RME / CCA · CXL.mem · AI / NPU / HBM · Formal verification

Navigate: → ← | Overview: Esc | Fullscreen: F

The Macro Picture — Where AMBA is Going

AMBA's trajectory over the past 5 years is no longer about new protocols — it's about new contexts:
- Interconnect crosses die boundaries (chiplets).
- Shared memory with non-Arm actors (CXL / accelerators).
- Hard partitioning and confidentiality (MPAM / RME).
- Pulling formal verification into the protocol spec itself.
Each trend creates new AMBA 5 revisions — no clean-sheet AMBA 6 yet.
The instruction-set side (Armv9) and the interconnect side (CHI-E/F) co-evolve: features like Realms require end-to-end attribute propagation.

The four forces

Disaggregation — the SoC becomes multiple chiplets with coherent memory.
Heterogeneity — CPUs, NPUs, GPUs, and DSAs all need a coherent memory fabric.
Security — RME / CCA adds a fourth security state on every transaction.
Partitioning — MPAM makes the interconnect a managed shared resource.

Chiplets — the Problem Statement

Reticle-limited dies (~800 mm²) cap the number of cores on a single piece of silicon.
Yields fall rapidly beyond 400 mm². A 72-core monolithic Neoverse V2 would be uneconomic.
Answer: multiple smaller chiplets (compute, I/O, memory) assembled on an interposer or packaging substrate.
But chiplet-to-chiplet links are nothing like on-die wires:
- 10–50× the energy per bit (2–5 pJ vs 0.1 pJ).
- 2–4× the latency (ns vs sub-ns).
- Link errors and retry needed (BER 1e-12 not 1e-18).

UCIe — the standard pipe

Universal Chiplet Interconnect Express (UCIe) — open spec (PCI-SIG + JEDEC + Arm + Intel + AMD + TSMC + ASE + Samsung) defines:

Standard (2D) and Advanced (2.5D/3D) packages
Link speeds 4–32 GT/s per lane
Lanes ~1 TB/s per mm of die edge
Protocol layer pluggable: PCIe, CXL, or CHI

CHI-C2C — Coherency over Chiplets

Arm's answer to chiplet coherency is CHI-C2C (Chip-to-Chip) — introduced in AMBA 5 CHI Issue F (2023+).
Preserves CHI's protocol-layer semantics across a UCIe or equivalent physical link.
Adds:
- Link-layer retry & CRC (for lossy physical links)
- Credit-aware C2C flow control
- Time-stamped ordering domains so a coherency domain can span chiplets
- Encryption and integrity for inter-chiplet traffic (important for CCA)
Lets a single coherent memory system span multiple compute chiplets and multiple memory chiplets.

UCIe — What's Inside

Physical layer — AFE + clock forwarding. Lane-per-mm density so a chiplet edge can host hundreds of lanes.
Die-to-die adapter — CRC, scrambler, retry, link training. Provides a lossless byte stream.
Protocol layer — UCIe 1.0 standardises PCIe and CXL as streaming formats. UCIe 1.1 / 2.0 add streaming CHI mode — mapping REQ/RSP/SNP/DAT to UCIe flits.
Advanced package (TSMC CoWoS / Intel EMIB / ASE FOCoS) gives sub-mm trace lengths and very low capacitance — enabling >10 TB/s between chiplets.

Why Arm cares

Without an open chiplet standard, every customer builds a custom interposer with a proprietary protocol. With UCIe + CHI-C2C, an Arm compute chiplet can drop into any UCIe-compliant package and get coherent memory for free.

NVIDIA Grace CPU uses NVLink-C2C (proprietary) today, but forecasted SoC roadmaps (MediaTek Arm server, Samsung Exynos W, etc.) target UCIe + CHI-C2C for non-Nvidia parts.

CXL — Coexistence, not Replacement

CXL (Compute Express Link) is the PCIe-over-CPU-coherency protocol from the CXL Consortium. Three sub-protocols:
- CXL.io — basically PCIe enumeration & I/O
- CXL.cache — device can cache host memory coherently
- CXL.mem — device exposes memory to host, host treats it as near-memory
AMBA role: an HN-I node on CMN-700 bridges to PCIe/CXL controllers; a CXL.mem expander appears as an SN-F-equivalent at the bridge.
On-chip coherency stays CHI; off-package coherency (across CXL) uses CXL's own ordering + snoop protocol. The bridge translates between them.

When CXL, when CHI-C2C?

CHI-C2C: inside a single package, Arm-Arm, minimum latency, maximum bandwidth, full coherency semantics. Think "one logical CPU made of chiplets".

CXL: between packages across a board, mixed vendors, tiered memory. Think "pooled memory and accelerators".

An Arm server SoC in 2025 may use both: CHI-C2C between its compute chiplets, CXL 3.x to external memory expanders or accelerators.

AI & Accelerators — Pulling AMBA to the Edge

Every modern AI accelerator (NVIDIA BlueField, Graviton DL, Ampere AmpereOne, Rebellions ATOM, Groq, Tenstorrent) exposes AMBA interfaces internally:
- NPU tensor core → AXI5 or CHI-compatible to the fabric
- DMA engines → ACE5-Lite for coherent scatter/gather
- Tensor unit micro-controller → AHB-Lite for config
HBM3 / HBM3e controllers are SN-F nodes delivering 819 GB/s per stack — the AMBA protocol has to not be the bottleneck.
AMBA 5 cache stashing (NIC → L3) is critical for latency on real-time inference: the model weights and recent activations must land in CPU/NPU cache without cache-coherency round trips.

AI-specific AMBA extensions

Scatter-gather DMA — AXI4 INCR bursts with descriptor chains, common for KV-cache page-ins.
Cache stashing targets — NPU writes results directly to the CPU cache where the next kernel will read them.
MPAM PARTID — model-serving tenants get guaranteed memory bandwidth.
Atomic 64-bit FP (AXI5 + Armv9) — lock-free gradient aggregation.

MPAM — Partitioning the Cloud

MPAM — Memory-system Partitioning and Monitoring — is the single biggest AMBA-land innovation for hyperscalers.
Every transaction carries a PARTID + PMG; the HN-F, memory controller, SMMU, and I/O bridges all enforce per-PARTID policies.
Enforceable resources:
- System Level Cache way / capacity quota
- Memory bandwidth fraction
- PCIe / CXL link shares
Monitorable: miss rates, traffic, latency histograms by PARTID.

Without MPAM, a noisy neighbour VM can slash a latency-critical service by 30%+. With MPAM (on CMN-650/700), QoS becomes a scheduler decision.

Kubernetes & MPAM

Linux 6.x resctrl grew MPAM support for Arm. A container / pod can request "5% LLC, 20% mem BW" and the kernel programs the per-PARTID limits across the CMN.

Monitoring example: AWS CloudWatch can surface per-tenant LLC miss rates on Graviton by reading MPAM counters — something not possible on x86 without Intel's CAT/MBM bolted on.

RME & CCA — Confidential Compute Arm-Style

Realm Management Extension (RME) — Armv9.2 — adds a 4-way security state on every memory transaction: Root / Realm / Secure / Non-Secure.
AMBA carries this via AxNSE (Non-Secure Extended) on AXI5 / ACE5 / CHI-E.
Every HN-F consults a Granule Protection Table (GPT) for each transaction — only compatible state combinations pass.
A compromised hypervisor cannot read Realm memory: the interconnect enforces the isolation.

The CCA stack

Hardware — Armv9 RME + CHI-E / AXI5
Firmware — Realm Management Monitor (RMM) running at EL3-R; TF-A managing Root
Guest OS — Linux with RSI (Realm Services Interface)
Hypervisor — KVM mod to create Realms but not read them

Comparable to: Intel TDX, AMD SEV-SNP. Arm's CCA differs in that the hardware enforcement is pushed deep into the interconnect (HN-F GPT checks), not just the CPU.

AMBA-over-PCIe & Virtualised I/O

PCIe endpoints don't speak AMBA natively. On an Arm SoC, a PCIe endpoint is fronted by a bridge that translates PCIe TLPs into AXI5/ACE5-Lite transactions.
For CXL, an endpoint can be I/O-coherent (CXL.cache) or memory-coherent (CXL.mem). The bridge maps to ACE5-Lite or to an HN-F-like coherent gateway accordingly.
Virtualised I/O:
- PCIe-SIG SR-IOV / ATS / PASID integrate with Arm's SMMU via RN-D nodes.
- Every I/O transaction carries a StreamID (for SMMU translation) and a PASID (for process isolation), both AMBA sideband fields in AMBA 5.

Example — NIC on BlueField

A NIC pipe on a BlueField DPU drives an internal AXI5 port with StreamID/PASID tagged on every packet. The SMMU resolves the translation; the CMN delivers to the correct coherent memory region; MPAM ensures tenant isolation; RME isolates Realm traffic.

In short: the bus-level work the industry used to do on I/O fencing has migrated into AMBA's sideband fields — StreamID, PASID, PARTID, NSE.

Formal Verification Goes First-Class

AMBA specifications now ship with formal property libraries as a companion deliverable.
Arm's ABVIP (AMBA Protocol Verification IP) provides SystemVerilog Assertions (SVA) for every transaction type, channel interaction, and ordering rule.
Commercial EDA (Cadence Jasper, Siemens Questa, Synopsys VC Formal) consume these directly.
Outcome: a modern SoC team proves protocol compliance formally, and restricts simulation to system-level scenarios.

Why protocols can be formally verified at all

AMBA protocols are built around bounded handshakes (VALID/READY) and small numbers of outstanding transactions per ID. The state space is finite per interface — perfect for model-checking.

Emerging trend: AMBA 5 refreshes now include machine-readable versions of protocol rules — allowing direct translation to formal tool input without human interpretation.

Trends Worth Watching

On-die Optical

Silicon photonics at the chiplet edge — replacing UCIe electrical lanes with optical ones. AMBA is physical-layer-agnostic, so CHI-C2C already fits.

CHI for accelerators (RN-I-FC)

Future CHI revisions may introduce an "accelerator Request Node" that holds caches but declines full RN-F snooping — tradeoff for NPUs that want partial coherency.

Near-memory compute via CXL 3.x

CXL.mem expanders with embedded ALUs can offload aggregation. AMBA bridges need semantics for "command-at-address" transactions — may spawn AXI5+ compute-ops.

Automotive Functional Safety

AMBA in ISO 26262 ASIL-D systems — dual-lockstep coherency, ECC everywhere, safety-island partitions. Drives new AxUSER fields for integrity tokens.

Post-quantum Security

AMBA-level encryption of C2C links and memory traffic using PQC primitives (Kyber, Dilithium). Adds latency & area to HN-F / memory controllers.

RISC-V TileLink convergence?

Unlikely in the short term. Commercial RISC-V cores still expose AXI5 externally because the IP ecosystem demands it. TileLink remains an internal RISC-V interconnect choice.

What a 2028-class Arm SoC Might Look Like

Threats to AMBA's Dominance

RISC-V growth: if open-cored systems become majority shipment in a market segment (MCU, FPGA soft-core, specific embedded), TileLink could displace AMBA there — but the installed AMBA IP base is enormous.
CXL as a successor? CXL.cache + CXL.mem can do much of what CHI does. If CXL wins at the node level, AMBA may retreat inside each chip. But CXL is PCIe-class latency (tens of ns) vs CHI (sub-ns on-die).
Apple: Apple's private interconnect (Ultra Fusion) is not AMBA. If every hyperscaler follows, Arm loses interconnect revenue — though not interconnect standardisation.
Intel & AMD: x86 interconnects (UPI, IF) remain closed. Not a displacement threat; a separate ecosystem.

Why AMBA will probably dominate through 2030

Ecosystem inertia — every EDA tool, every VIP, every verification methodology is AMBA-native.
Arm's aggressive spec refreshes — new features (RME, MPAM, CHI-C2C) keep the protocol in the lead.
Open licensing — nothing stops a RISC-V company from shipping AMBA externally (many do).
Network effects on verification — new protocols are priced out by the cost of equivalent VIP maturity.

Second-Order Effects

Cloud scheduling becomes interconnect-aware — MPAM + CHI observability lets schedulers optimise for cache and bandwidth footprint per tenant.
OS memory managers gain new primitives: Realm creation, Realm memory zeroisation, per-Realm TLBI via CHI DVM.
Fault models shift: hardware-confidential compute means kernel bugs cannot leak Realm data, which simplifies some security guarantees and complicates others.
Debug: CoreSight (ATB — the AMBA trace bus) itself gains security attributes, since trace streams in a Realm system must not be readable across states.

Workload influence on architecture

LLM inference is reshaping every number in the SoC roadmap — KV cache residence times, NPU-to-memory bandwidth ratios, near-memory computation, tensor-parallel all-reduce patterns. AMBA evolves to serve these, not abstract benchmarks.

Concrete example: Arm Neoverse V3 (coming 2026) reportedly has a doubled SLC per tile to accommodate LLM KV cache working sets.

A Speculative AMBA 6?

There is no Arm announcement of AMBA 6. But if it arrives, it will likely unify:
- CHI-C2C as first-class (no longer a CHI issue, but a distinct layer)
- Optical transport (silicon photonics) as a physical-layer variant
- Near-memory compute semantics (compute-at-slave operations)
- Post-quantum cryptographic sidebands for C2C encryption
- Deeper integration with CXL 3.x / 4.x coherency semantics
Or, as has happened since AMBA 3, the "AMBA 5" umbrella keeps stretching indefinitely.

How Arm usually versions

AMBA 2 to AMBA 3 was a true break (ASB-era semantics → AXI channels). AMBA 3 to AMBA 4 was extension. AMBA 4 to AMBA 5 was extension of scale (CHI for >16 nodes).

AMBA 6 — if it arrives — will probably be triggered by a similar discontinuity, likely package-level interconnect becoming as important as on-die.

What Engineers Should Learn Next

If you're a SoC architect

Read the CHI-E spec cover to cover (free download).
Study CMN-700 topology, mapping, and SLC hashing.
Build a pen-and-paper model of mesh traffic for a target workload.

If you're a verification engineer

Get hands-on with Jasper or VC Formal on AXI/CHI protocol properties.
Run litmus tests (herd7) on Armv9 to understand consistency.

If you're a systems engineer

Experiment with MPAM / resctrl on a Graviton or Ampere system.
Read the RMM (Realm Management Monitor) design docs when CCA becomes broadly available.

If you're a chiplet / package engineer

Read UCIe 2.0 and CHI-C2C specs side-by-side.
Study Intel EMIB, TSMC CoWoS-L, ASE FOCoS packaging choices and their bandwidth/energy numbers.

The Big-Picture Summary

AMBA started in 1996 as a piece of technical diplomacy — an open bus so Arm's licensees could integrate their chips.
Twenty-eight years later it is the most deployed on-chip interconnect standard in history, dwarfing every alternative.
Its 2025 evolution is being driven by four converging pressures: chiplets, AI accelerators, confidential compute, and hyperscaler partitioning.
Arm's response is layered: CHI for system-scale coherency, CHI-C2C for chiplets, MPAM for partitioning, RME for confidential state, AxUSER/PARTID for the sideband metadata the modern system needs.

"The interconnect is no longer a piece of plumbing. It's where confidentiality is enforced, where tenants are isolated, where chiplets are stitched together, and where the semantics of your compute platform live. AMBA is quietly becoming the operating contract of modern silicon." — Industry observation, 2023

Interview-Ready Takeaways

"What is CHI-C2C and why does it exist?" → A CHI transport over UCIe or equivalent die-to-die links so multiple chiplets can form one coherent memory domain. Exists because reticle-limited monolithic dies no longer scale to 100+ cores.
"How does CXL relate to AMBA?" → CXL is across-package; AMBA is on-package / on-die. Bridges translate CXL.mem into CHI SN-F-style access; CXL.cache into ACE5-Lite-style coherency.
"What's MPAM solving?" → Noisy-neighbour effects in multi-tenant systems. Per-transaction PARTID gives the interconnect fine-grained enforcement of cache / bandwidth shares.

"What does RME add to AMBA?" → A 4-way security state on every transaction (Root / Realm / Secure / Non-Secure) + GPT checks at the HN-F. Implements Arm Confidential Compute in hardware at interconnect level.
"Will RISC-V kill AMBA?" → Unlikely. RISC-V cores ship AXI5/CHI interfaces externally because the ecosystem demands it. TileLink remains an internal-only choice for some RISC-V designs.
"What's the next protocol discontinuity?" → Most likely package-level interconnect — optical C2C and disaggregated memory pools — triggering either an AMBA 6 or a major CHI-F+ extension.

References

Arm Ltd. — AMBA 5 CHI Architecture Specification (IHI 0050), Issues E and F (RME, CHI-C2C)
Arm Ltd. — Arm Confidential Compute Architecture (CCA) Whitepaper
Arm Ltd. — Arm Realm Management Extension (RME) Specification
Arm Ltd. — Arm MPAM Architecture Specification
UCIe Consortium — UCIe 1.1 / 2.0 Specifications, available at uciexpress.org
CXL Consortium — Compute Express Link Specification 3.0 / 3.1, available at computeexpresslink.org
Arm Ltd. — Neoverse V3 / N3 announcements (2024) — Arm Tech Day presentations
Biswas, A. et al. — "CMN-700 & Beyond" — HotChips 34 / 35 tutorials
SemiAnalysis, Linley Group — ongoing reporting on chiplet economics and AMBA ecosystem
OpenCompute Project — Chiplet Design Exchange (CDX) working group reports — interoperability across UCIe and CHI-C2C
Wikipedia — "UCIe", "Compute Express Link", "Arm Confidential Compute Architecture" — well-sourced cross-references

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.