ARM AMBA · PRESENTATION 04

ACE & ACE-Lite

Bringing Cache Coherency to AXI · Five Cache States, Three Snoop Channels

AC · CR · CD · UC / UD / SC / SD / I · ReadShared · ReadUnique · DVM · CCI-400/500/550

Navigate: → ← | Overview: Esc | Fullscreen: F

Why Coherency? Why AMBA?

By 2010, multi-core Arm SoCs (Cortex-A9 quad) had reached the point where software-managed cache maintenance was no longer acceptable.
The Linux kernel needed hardware cache coherency for SMP scheduling and page-table updates to work reasonably.
big.LITTLE (Cortex-A7 + A15, 2011) made it unavoidable — the OS had to migrate threads between clusters with live cache state.
Arm needed a bus protocol that could carry coherency primitives. Bolting them onto AXI gave ACE — AXI Coherency Extensions — in AMBA 4 (2010).

ACE in one sentence

ACE adds three snoop channels to AXI and extends the transaction type space so a requesting cache can signal what it wants (Shared, Unique, etc.) and peer caches can report back what they have.

Arm publishes ACE alongside AXI in a single specification (IHI 0022). Every AXI signal still exists; ACE just layers on top.

MOESI — Refresher

Classical cache-coherency states, named after the five-state protocol popularised by AMD / SPARC:
- Modified — dirty, exclusive, must write back
- Owned — dirty, shared (this cache responsible for write-back)
- Exclusive — clean, only copy
- Shared — clean, may be copies elsewhere
- Invalid — not valid
ACE uses a variant with five equivalent states but renamed to fit AXI's semantics — UniqueClean / UniqueDirty / SharedClean / SharedDirty / Invalid.

ACE ↔ MOESI mapping

ACE	MOESI
UniqueClean (UC)	E
UniqueDirty (UD)	M
SharedClean (SC)	S
SharedDirty (SD)	O
Invalid (I)	I

The MOESI "O" (Owned) state — the one responsible for supplying clean copies to peers and eventually writing back — is SharedDirty in ACE-speak.

The Five ACE Cache States

UniqueClean

This cache holds the only copy. Memory is up to date. Equivalent to E.

UniqueDirty

Only copy; memory is stale. Must write back before eviction. Equivalent to M.

SharedClean

Possibly shared with peers. Memory may be up-to-date or another peer may own a dirty copy. Equivalent to S.

SharedDirty

Shared with peers; this cache owns the dirty copy and is responsible for write-back. Equivalent to O.

Invalid

Line not present / dropped.

Key invariants

At most one cache can be Unique (UC or UD) for a given line.
At most one cache can be SharedDirty for a given line.
Multiple caches can be SharedClean simultaneously.

What's "Dirty"?

A cache line is Dirty if the data in it differs from the underlying memory's value. The responsibility to eventually write-back is always with exactly one cache (UD or SD).

ACE Channels — AXI Plus Three

On top of the five AXI channels (AR, R, AW, W, B), ACE adds three snoop channels:
- AC — Snoop Address (interconnect → cache)
- CR — Snoop Response (cache → interconnect)
- CD — Snoop Data (cache → interconnect, optional)
When a new request arrives at the interconnect, the Home logic broadcasts an AC snoop to all peer caches; each responds with CR (state + hit/miss), and if it has modified data, drives CD.
Plus a DVM logical channel (Distributed Virtual Memory) — piggybacks on the snoop channels for TLB invalidation, I-cache maintenance, and barrier messages.

ACE Transaction Types — Read Side

Transaction	When used	Finishing state
ReadOnce	One-shot snoop read; don't cache	(not cached)
ReadClean	Line fill expecting clean data only	UC or SC
ReadNotSharedDirty	Line fill, willing to accept any state except SD	UC / UD / SC
ReadShared	Read expecting to share it	UC/UD/SC/SD
ReadUnique	Read intending to modify — invalidate all peers	UC or UD
CleanUnique	Already have SC/SD, need to upgrade to unique before writing	UC
MakeUnique	Overwrite entire line — no data transfer needed	UD

Why so many?

Each type tells the coherency logic exactly what ownership and data transfer are needed — lets the interconnect do the minimum work. A cache-line upgrade (SC → UC) needs no data movement at all, only invalidation messages.

CleanUnique vs MakeUnique: CleanUnique keeps the current data valid (reading before writing); MakeUnique discards current data (full-line overwrite, like memset). MakeUnique saves bus bandwidth on whole-line writes.

ACE Transaction Types — Write Side

Transaction	Purpose
WriteUnique	Non-cached write; snoops others to invalidate first
WriteLineUnique	Full-line non-cached write
WriteBack	Evict a dirty line (UD or SD) to memory
WriteClean	Flush dirty line but keep cached (UD → UC)
Evict	Inform coherency that a clean line was dropped
WriteEvict	Write-back + invalidate in one step

Why "Evict" (clean drop)?

A snoop filter inside a coherent interconnect needs to know which caches currently hold which lines — if a CPU silently drops a clean line, the filter's tracking diverges. ACE requires explicit Evict for clean drops so the filter can reclaim that entry.

WriteEvict is the efficient case: you want to write back and drop, in one transaction. Used at end-of-workload or on LRU eviction of a modified line.

Snoop Transactions — from the Interconnect

When a ReadShared / ReadUnique / CleanUnique arrives at the interconnect, it broadcasts a snoop to every peer cache on their AC channel.
Snoop types mirror the request semantics:
- ReadOnce / ReadClean / ReadNotSharedDirty / ReadShared / ReadUnique — peer must supply data if present, and transition its state accordingly.
- CleanInvalid — peer must write back dirty data, then invalidate.
- MakeInvalid — peer drops the line without writing back (used when the requester intends to overwrite the whole line).
Peer responds with a 5-bit CR[4:0] encoding hit/miss, passed-dirty, passed-data, error.

CRRESP encoding

CRRESP[0] DataTransfer — CD active
CRRESP[1] Error         — snoop failed
CRRESP[2] PassDirty     — responder owns dirty
CRRESP[3] IsShared      — responder keeps shared
CRRESP[4] WasUnique     — responder was unique

A peer that's already Invalid for a line can respond with CRRESP=0 — "I have nothing." This is the common case; most snoops miss most peers.

A Coherent Read — Full Flow

CPU1 was UniqueDirty; after the snoop it becomes SharedDirty (retains ownership of dirty copy). CPU0 becomes SharedClean — it has the data, but responsibility for write-back stays with CPU1. No DRAM write was required.

AxDOMAIN & AxSNOOP

AxDOMAIN[1:0]

Controls the scope of snooping:

00 Non-shareable — no snoop needed
01 Inner Shareable — snoop within the Inner domain (typically a cluster)
10 Outer Shareable — snoop across the outer domain (multiple clusters, system cache)
11 System — snoop to memory system

AxSNOOP[3:0]

Encodes the specific transaction type (ReadOnce / ReadShared / CleanUnique / etc.). Combined with AxDOMAIN, the interconnect knows exactly who to snoop and how.

Mapping from MMU to ACE

Armv8 page tables specify Shareability (Inner / Outer / None) per-region. The CPU combines that with the memory-type to produce AxDOMAIN + AxCACHE — the ACE interconnect then enforces the coherency domain.

Performance hint: marking a big DMA buffer Non-shareable saves every snoop cycle that would otherwise be wasted. Many OS drivers map DMA memory as Non-shareable for this reason.

ACE-Lite — I/O Coherency

Most peripherals aren't caches — they're producers/consumers of memory. They don't need to hold cache state, but they benefit enormously from the CPU's caches being coherent with their traffic.
ACE-Lite is a restricted ACE profile for such agents. Signals from the master side:
- AXI + AxDOMAIN + AxSNOOP (so the interconnect can snoop CPU caches).
- No AC / CR / CD channels on the master (it has no caches to snoop).
The master reads from / writes to memory, and the interconnect automatically snoops CPU caches to return fresh data or invalidate stale copies.
This is the coherent DMA / GPU / NIC / NPU interface.

One-way coherency

An ACE-Lite master is coherency-transparent: the CPU stays coherent with it, but the master itself has no caches. No peer snoops the master — it would find nothing.

Cortex-A57 + Mali GPU over CCI-500 is the canonical example: the GPU uses ACE-Lite, so memory writes from CPUs are visible to the GPU without software cache maintenance.

DVM — Distributed Virtual Memory

Cache coherency is only half of SMP correctness — the other half is TLB coherency.
When one CPU invalidates a page-table entry (e.g. page removed, permission changed), every other CPU's TLBs must drop the corresponding entries before the OS can reuse the page.
DVM messages travel on the ACE snoop channels (AC, with a special SNOOP type) and carry:
- TLB Invalidate (TLBI) — by VA, by ASID, by all
- Branch Predictor Invalidate (BPI)
- Instruction-cache Invalidate (ICI)
- Sync — pipelined barrier
Every receiver must acknowledge; a DVM Sync waits for all acks before proceeding.

DVM + TLBI IS instructions

When software executes TLBI IS, ... on Armv8, the CPU issues a DVM TLBI to the interconnect, which broadcasts it to every Inner-Shareable participant. The subsequent DSB ISH waits on the DVM Sync acks.

DVM is often the bottleneck on workloads with heavy fork/mmap churn. Arm server chips put serious engineering into DVM fan-out so it doesn't serialise.

CCI-400 / CCI-500 / CCI-550

CCI	Year	ACE masters	ACE-Lite slaves
CCI-400	2011	2 (clusters)	3
CCI-500	2014	4	7, + snoop filter
CCI-550	2015	6	7, larger snoop filter

CCI = Cache Coherent Interconnect. Arm's off-the-shelf IP used in virtually every big.LITTLE and LITTLE-only SoC between 2011 and 2016.

Why a snoop filter mattered

CCI-400 broadcast every snoop to every ACE master. With 2 clusters × 4 CPUs, fine. With 6–8 clusters, every unrelated miss generated up to 7 snoops, overwhelming the ACE links.

CCI-500 added a snoop filter — an inclusive directory of which caches hold which lines — so only relevant snoops are broadcast. A 95%+ snoop-reduction rate was typical.

CCI-550 was the last generation of CCI. Beyond 8 coherent masters, the broadcast architecture became uneconomical — which motivated the CHI transition.

Snoop Filters — How They Work

The snoop filter is a directory: for each cache line currently cached anywhere in the system, track which masters hold it.
Typically implemented as a set-associative tag structure, inclusive of the caches it tracks.
On a coherent request:
- Lookup the address in the filter.
- If no master holds it → skip snoop entirely, go direct to memory.
- If one or more masters hold it → snoop only those.
Filter capacity must be ≥ the sum of tracked cache capacities (or suffer back-invalidates to the slowest cache when the filter overflows).

Filter as a cache

A too-small snoop filter becomes a bottleneck: on filter-miss-with-back-invalidate, a potentially-valid cache line gets force-invalidated just to free a filter entry, causing the CPU to miss later.

Arm's CCI-500/550 filters are typically sized 1.2–1.5× the aggregate cache capacity across the clusters they serve. CHI's Home Node filters (next deck) go further — they're often oversized and fully inclusive.

ACE Domains & Sharing Scope

A CPU accessing a stack variable marks it Inner Shareable — only its own cluster needs to snoop.
A kernel data structure shared across clusters is marked Outer Shareable — all clusters + all ACE-Lite masters snoop.
DMA-mapped network buffer: typically Outer Shareable if the CPU and the NIC genuinely share it; Non-shareable if the driver manages cache maintenance manually.
This is why Armv8 OS kernels tune page table attributes carefully — the wrong shareability domain can be a 10% performance bug.

ACE5 — AMBA 5 Refresh

ACE5 (2017) brought AXI5's improvements to the coherent profile:
- Atomic transactions — same offload as AXI5, but can participate in coherency (other caches get snooped accordingly).
- Cache stashing — an ACE master can hint that a cache line be installed into a target cache's state (UC/SC).
- Cache Maintenance Operations (CMO) — system-wide cache clean/invalidate over the bus rather than CPU-executed loops.
- MTE support — tag bits on every coherent transaction.
- RME hooks — AxNSE for 4-way security states (Root/Realm/Secure/Non-Secure).

ACE5-Lite

The I/O-coherent profile got AMBA 5'd as well — same changes, but without the master-side snoop channels. Every modern NPU or GPU that wants to see CPU cache writes uses ACE5-Lite.

Common question: "Why both ACE5 and CHI?" — Arm keeps ACE5 for master-side coherency at cluster scale, where its signal-level wire is cheapest. CHI is for system-level fabric.

ACE's Scaling Limits

ACE's basic model is snoop-everyone. At 2 clusters → 1 peer to snoop. At 8 clusters → 7 peers per miss.
Snoop filters amortise but don't eliminate the fan-out.
The AC / CR / CD channels are signal-level buses — every master port has them physically. For 32+ masters, the wire count becomes unbearable on a reticle-sized die.
Beyond ~8 coherent masters, the protocol's point-to-point wiring assumption runs out.
Solution: move to a packet-based protocol where any number of nodes can share the same physical transport (mesh NoC). That protocol is CHI (next deck).

The 2015 crisis

Arm's early server ambitions (Cortex-A57 16-core systems via CCI-550 + CCN-504) ran into ACE's scaling wall. The CCN-504 Cache Coherent Network interconnect used an internal CHI-like ring, but the customer-visible ACE ports were still signal-level. For Neoverse-class designs Arm pushed customers to CHI at the master interface too.

Today ACE is the right choice at cluster scale (≤8 CPUs); CHI is the right choice at system scale (32–128+ coherent nodes).

ACE in the Real World

ACE systems that shipped big

Samsung Exynos 5 Octa (2013) — Cortex-A15 + A7 big.LITTLE over CCI-400.
Qualcomm Snapdragon 820 (2016) — Kryo clusters over CCI-550.
Mediatek Helio X10/X20 (2015–16) — deca-core designs via CCI-500.
Apple A6 → A11 — bespoke interconnects with ACE interfaces (though Apple runs custom cores).
Amlogic, Allwinner, RockChip — countless set-top / automotive SoCs.

ACE-Lite on the periphery

Mali GPUs (T6xx, T7xx, G-series) — ACE-Lite for coherent buffer sharing with the CPU.
Video decoders & ISPs — ACE-Lite so decoded frames are visible to CPU caches.
Cortex-R coherent peers in R82 (2020) — real-time cores joining an A-profile coherency domain.
Arm SMMU MMU-600/MMU-700 — participates in DVM on behalf of I/O devices.

Minimal ACE-Lite Master — Almost Free

The cheapest way to be "coherent"

An ACE-Lite master has no snoop channels. Physically it's just an AXI4 master with a few extra signals (AxDOMAIN, AxSNOOP, AxBAR).
If all you need is I/O coherency ("CPU cache sees my writes"), you can bolt ACE-Lite onto an existing AXI master by driving the new attributes to sensible constants — zero extra logic.
Interconnect (CCI) sees those constants and snoops the CPU on behalf of the master.

This is why almost every modern DMA engine gained "coherency" in one RTL afternoon — the work was on the interconnect side, not on the master.

// Thin ACE-Lite wrapper around an AXI4 master.
// No snoop channels exist on the master side.
module acel_wrap #(parameter ID_W = 4) (
  // existing AXI4 port (connect to your master IP)
  output logic [ID_W-1:0] ARID,
  output logic [31:0]     ARADDR,
  output logic [7:0]      ARLEN,
  output logic [2:0]      ARSIZE,
  output logic [1:0]      ARBURST,
  output logic            ARVALID,
  input  logic            ARREADY,
  // ... (the rest of AR/R/AW/W/B as usual)

  // ACE-Lite extra attributes — pure combinational tie-offs
  output logic [1:0]      ARDOMAIN,
  output logic [3:0]      ARSNOOP,
  output logic [1:0]      ARBAR,
  output logic [1:0]      AWDOMAIN,
  output logic [2:0]      AWSNOOP,
  output logic [1:0]      AWBAR
);
  // I/O-coherent DMA targeting shared memory:
  //   Outer Shareable, ReadOnce / WriteUnique,
  //   No barriers.
  assign ARDOMAIN = 2'b10;    // Outer Shareable
  assign ARSNOOP  = 4'b0000;  // ReadOnce
  assign ARBAR    = 2'b00;    // No barrier

  assign AWDOMAIN = 2'b10;    // Outer Shareable
  assign AWSNOOP  = 3'b001;   // WriteUnique
  assign AWBAR    = 2'b00;
endmodule

Six assigns. That's it. Total new gates for coherent DMA = 0 — the CCI / CMN does all the work.

Interview-Ready Takeaways

"What are the five ACE cache states?" → UC, UD, SC, SD, I. They map to E, M, S, O, I in MOESI.
"What's the difference between ReadShared and ReadUnique?" → ReadShared accepts shared copies elsewhere; ReadUnique invalidates all peers so the requester can modify.
"What are the three snoop channels?" → AC (snoop address, interconnect → cache), CR (snoop response), CD (snoop data).
"Why ACE-Lite for GPUs?" → GPUs are producers/consumers of memory, not holders of cache state. ACE-Lite lets them enter the coherency domain one-way, without exposing unused snoop channels.

"What is DVM?" → Distributed Virtual Memory — TLB/I-cache maintenance messages piggybacked on the ACE snoop channels so SMP software doesn't need cross-CPU IPIs for every TLBI.
"Why does a snoop filter matter?" → It turns a broadcast snoop into a targeted one. At 8+ masters, broadcasting is 10× more expensive than filtering.
"Why doesn't ACE scale to 32 coherent masters?" → Point-to-point signal-level wiring; the AC/CR/CD channels must physically reach every master port. Solved by moving to packet-based CHI.
"What did ACE5 add?" → Atomics, cache stashing, CMO, MTE tags, RME hooks.

References

Arm Ltd. — AMBA AXI and ACE Protocol Specification (IHI 0022) — contains the full ACE / ACE-Lite / ACE5 definitions
Arm Ltd. — CoreLink CCI-400 Cache Coherent Interconnect Technical Reference Manual
Arm Ltd. — CCI-500 / CCI-550 Technical Reference Manuals — snoop filter design & sizing
Arm Ltd. — Cortex-A15 MPCore TRM — reference ACE master definition
Hennessy & Patterson — Computer Architecture: A Quantitative Approach, 6th ed. — chapter on multiprocessor cache coherence (MOESI variants)
Sorin, D., Hill, M., Wood, D. — A Primer on Memory Consistency and Cache Coherence (Synthesis Lectures, 2011) — the canonical textbook on directory vs snooping protocols
Sridharan, S. et al. — ACE Protocol and Coherency — Arm Tech Summit presentations (2013–2016)
Martonosi, M. et al. — papers on DVM scalability in Arm SoCs (ASPLOS, MICRO)
Wikipedia — "MOESI protocol", "Cache coherence" — well-sourced cross-references

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.