Bringing Cache Coherency to AXI · Five Cache States, Three Snoop Channels
AC · CR · CD · UC / UD / SC / SD / I · ReadShared · ReadUnique · DVM · CCI-400/500/550
Navigate: → ← | Overview: Esc | Fullscreen: F
02
Why Coherency? Why AMBA?
By 2010, multi-core Arm SoCs (Cortex-A9 quad) had reached the point where software-managed cache maintenance was no longer acceptable.
The Linux kernel needed hardware cache coherency for SMP scheduling and page-table updates to work reasonably.
big.LITTLE (Cortex-A7 + A15, 2011) made it unavoidable — the OS had to migrate threads between clusters with live cache state.
Arm needed a bus protocol that could carry coherency primitives. Bolting them onto AXI gave ACE — AXI Coherency Extensions — in AMBA 4 (2010).
ACE in one sentence
ACE adds three snoop channels to AXI and extends the transaction type space so a requesting cache can signal what it wants (Shared, Unique, etc.) and peer caches can report back what they have.
Arm publishes ACE alongside AXI in a single specification (IHI 0022). Every AXI signal still exists; ACE just layers on top.
03
MOESI — Refresher
Classical cache-coherency states, named after the five-state protocol popularised by AMD / SPARC:
Modified — dirty, exclusive, must write back
Owned — dirty, shared (this cache responsible for write-back)
Exclusive — clean, only copy
Shared — clean, may be copies elsewhere
Invalid — not valid
ACE uses a variant with five equivalent states but renamed to fit AXI's semantics — UniqueClean / UniqueDirty / SharedClean / SharedDirty / Invalid.
ACE ↔ MOESI mapping
ACE
MOESI
UniqueClean (UC)
E
UniqueDirty (UD)
M
SharedClean (SC)
S
SharedDirty (SD)
O
Invalid (I)
I
The MOESI "O" (Owned) state — the one responsible for supplying clean copies to peers and eventually writing back — is SharedDirty in ACE-speak.
04
The Five ACE Cache States
UniqueClean
This cache holds the only copy. Memory is up to date. Equivalent to E.
UniqueDirty
Only copy; memory is stale. Must write back before eviction. Equivalent to M.
SharedClean
Possibly shared with peers. Memory may be up-to-date or another peer may own a dirty copy. Equivalent to S.
SharedDirty
Shared with peers; this cache owns the dirty copy and is responsible for write-back. Equivalent to O.
Invalid
Line not present / dropped.
Key invariants
At most one cache can be Unique (UC or UD) for a given line.
At most one cache can be SharedDirty for a given line.
Multiple caches can be SharedClean simultaneously.
What's "Dirty"?
A cache line is Dirty if the data in it differs from the underlying memory's value. The responsibility to eventually write-back is always with exactly one cache (UD or SD).
05
ACE Channels — AXI Plus Three
On top of the five AXI channels (AR, R, AW, W, B), ACE adds three snoop channels:
AC — Snoop Address (interconnect → cache)
CR — Snoop Response (cache → interconnect)
CD — Snoop Data (cache → interconnect, optional)
When a new request arrives at the interconnect, the Home logic broadcasts an AC snoop to all peer caches; each responds with CR (state + hit/miss), and if it has modified data, drives CD.
Plus a DVM logical channel (Distributed Virtual Memory) — piggybacks on the snoop channels for TLB invalidation, I-cache maintenance, and barrier messages.
06
ACE Transaction Types — Read Side
Transaction
When used
Finishing state
ReadOnce
One-shot snoop read; don't cache
(not cached)
ReadClean
Line fill expecting clean data only
UC or SC
ReadNotSharedDirty
Line fill, willing to accept any state except SD
UC / UD / SC
ReadShared
Read expecting to share it
UC/UD/SC/SD
ReadUnique
Read intending to modify — invalidate all peers
UC or UD
CleanUnique
Already have SC/SD, need to upgrade to unique before writing
UC
MakeUnique
Overwrite entire line — no data transfer needed
UD
Why so many?
Each type tells the coherency logic exactly what ownership and data transfer are needed — lets the interconnect do the minimum work. A cache-line upgrade (SC → UC) needs no data movement at all, only invalidation messages.
CleanUnique vs MakeUnique: CleanUnique keeps the current data valid (reading before writing); MakeUnique discards current data (full-line overwrite, like memset). MakeUnique saves bus bandwidth on whole-line writes.
07
ACE Transaction Types — Write Side
Transaction
Purpose
WriteUnique
Non-cached write; snoops others to invalidate first
WriteLineUnique
Full-line non-cached write
WriteBack
Evict a dirty line (UD or SD) to memory
WriteClean
Flush dirty line but keep cached (UD → UC)
Evict
Inform coherency that a clean line was dropped
WriteEvict
Write-back + invalidate in one step
Why "Evict" (clean drop)?
A snoop filter inside a coherent interconnect needs to know which caches currently hold which lines — if a CPU silently drops a clean line, the filter's tracking diverges. ACE requires explicit Evict for clean drops so the filter can reclaim that entry.
WriteEvict is the efficient case: you want to write back and drop, in one transaction. Used at end-of-workload or on LRU eviction of a modified line.
08
Snoop Transactions — from the Interconnect
When a ReadShared / ReadUnique / CleanUnique arrives at the interconnect, it broadcasts a snoop to every peer cache on their AC channel.
Snoop types mirror the request semantics:
ReadOnce / ReadClean / ReadNotSharedDirty / ReadShared / ReadUnique — peer must supply data if present, and transition its state accordingly.
CleanInvalid — peer must write back dirty data, then invalidate.
MakeInvalid — peer drops the line without writing back (used when the requester intends to overwrite the whole line).
Peer responds with a 5-bit CR[4:0] encoding hit/miss, passed-dirty, passed-data, error.
CRRESP encoding
CRRESP[0] DataTransfer — CD active
CRRESP[1] Error — snoop failed
CRRESP[2] PassDirty — responder owns dirty
CRRESP[3] IsShared — responder keeps shared
CRRESP[4] WasUnique — responder was unique
A peer that's already Invalid for a line can respond with CRRESP=0 — "I have nothing." This is the common case; most snoops miss most peers.
09
A Coherent Read — Full Flow
CPU1 was UniqueDirty; after the snoop it becomes SharedDirty (retains ownership of dirty copy). CPU0 becomes SharedClean — it has the data, but responsibility for write-back stays with CPU1. No DRAM write was required.
10
AxDOMAIN & AxSNOOP
AxDOMAIN[1:0]
Controls the scope of snooping:
00 Non-shareable — no snoop needed
01 Inner Shareable — snoop within the Inner domain (typically a cluster)
10 Outer Shareable — snoop across the outer domain (multiple clusters, system cache)
11 System — snoop to memory system
AxSNOOP[3:0]
Encodes the specific transaction type (ReadOnce / ReadShared / CleanUnique / etc.). Combined with AxDOMAIN, the interconnect knows exactly who to snoop and how.
Mapping from MMU to ACE
Armv8 page tables specify Shareability (Inner / Outer / None) per-region. The CPU combines that with the memory-type to produce AxDOMAIN + AxCACHE — the ACE interconnect then enforces the coherency domain.
Performance hint: marking a big DMA buffer Non-shareable saves every snoop cycle that would otherwise be wasted. Many OS drivers map DMA memory as Non-shareable for this reason.
11
ACE-Lite — I/O Coherency
Most peripherals aren't caches — they're producers/consumers of memory. They don't need to hold cache state, but they benefit enormously from the CPU's caches being coherent with their traffic.
ACE-Lite is a restricted ACE profile for such agents. Signals from the master side:
AXI + AxDOMAIN + AxSNOOP (so the interconnect can snoop CPU caches).
No AC / CR / CD channels on the master (it has no caches to snoop).
The master reads from / writes to memory, and the interconnect automatically snoops CPU caches to return fresh data or invalidate stale copies.
This is the coherent DMA / GPU / NIC / NPU interface.
One-way coherency
An ACE-Lite master is coherency-transparent: the CPU stays coherent with it, but the master itself has no caches. No peer snoops the master — it would find nothing.
Cortex-A57 + Mali GPU over CCI-500 is the canonical example: the GPU uses ACE-Lite, so memory writes from CPUs are visible to the GPU without software cache maintenance.
12
DVM — Distributed Virtual Memory
Cache coherency is only half of SMP correctness — the other half is TLB coherency.
When one CPU invalidates a page-table entry (e.g. page removed, permission changed), every other CPU's TLBs must drop the corresponding entries before the OS can reuse the page.
DVM messages travel on the ACE snoop channels (AC, with a special SNOOP type) and carry:
TLB Invalidate (TLBI) — by VA, by ASID, by all
Branch Predictor Invalidate (BPI)
Instruction-cache Invalidate (ICI)
Sync — pipelined barrier
Every receiver must acknowledge; a DVM Sync waits for all acks before proceeding.
DVM + TLBI IS instructions
When software executes TLBI IS, ... on Armv8, the CPU issues a DVM TLBI to the interconnect, which broadcasts it to every Inner-Shareable participant. The subsequent DSB ISH waits on the DVM Sync acks.
DVM is often the bottleneck on workloads with heavy fork/mmap churn. Arm server chips put serious engineering into DVM fan-out so it doesn't serialise.
13
CCI-400 / CCI-500 / CCI-550
CCI
Year
ACE masters
ACE-Lite slaves
CCI-400
2011
2 (clusters)
3
CCI-500
2014
4
7, + snoop filter
CCI-550
2015
6
7, larger snoop filter
CCI = Cache Coherent Interconnect. Arm's off-the-shelf IP used in virtually every big.LITTLE and LITTLE-only SoC between 2011 and 2016.
Why a snoop filter mattered
CCI-400 broadcast every snoop to every ACE master. With 2 clusters × 4 CPUs, fine. With 6–8 clusters, every unrelated miss generated up to 7 snoops, overwhelming the ACE links.
CCI-500 added a snoop filter — an inclusive directory of which caches hold which lines — so only relevant snoops are broadcast. A 95%+ snoop-reduction rate was typical.
CCI-550 was the last generation of CCI. Beyond 8 coherent masters, the broadcast architecture became uneconomical — which motivated the CHI transition.
14
Snoop Filters — How They Work
The snoop filter is a directory: for each cache line currently cached anywhere in the system, track which masters hold it.
Typically implemented as a set-associative tag structure, inclusive of the caches it tracks.
On a coherent request:
Lookup the address in the filter.
If no master holds it → skip snoop entirely, go direct to memory.
If one or more masters hold it → snoop only those.
Filter capacity must be ≥ the sum of tracked cache capacities (or suffer back-invalidates to the slowest cache when the filter overflows).
Filter as a cache
A too-small snoop filter becomes a bottleneck: on filter-miss-with-back-invalidate, a potentially-valid cache line gets force-invalidated just to free a filter entry, causing the CPU to miss later.
Arm's CCI-500/550 filters are typically sized 1.2–1.5× the aggregate cache capacity across the clusters they serve. CHI's Home Node filters (next deck) go further — they're often oversized and fully inclusive.
15
ACE Domains & Sharing Scope
A CPU accessing a stack variable marks it Inner Shareable — only its own cluster needs to snoop.
A kernel data structure shared across clusters is marked Outer Shareable — all clusters + all ACE-Lite masters snoop.
DMA-mapped network buffer: typically Outer Shareable if the CPU and the NIC genuinely share it; Non-shareable if the driver manages cache maintenance manually.
This is why Armv8 OS kernels tune page table attributes carefully — the wrong shareability domain can be a 10% performance bug.
16
ACE5 — AMBA 5 Refresh
ACE5 (2017) brought AXI5's improvements to the coherent profile:
Atomic transactions — same offload as AXI5, but can participate in coherency (other caches get snooped accordingly).
Cache stashing — an ACE master can hint that a cache line be installed into a target cache's state (UC/SC).
Cache Maintenance Operations (CMO) — system-wide cache clean/invalidate over the bus rather than CPU-executed loops.
MTE support — tag bits on every coherent transaction.
RME hooks — AxNSE for 4-way security states (Root/Realm/Secure/Non-Secure).
ACE5-Lite
The I/O-coherent profile got AMBA 5'd as well — same changes, but without the master-side snoop channels. Every modern NPU or GPU that wants to see CPU cache writes uses ACE5-Lite.
Common question: "Why both ACE5 and CHI?" — Arm keeps ACE5 for master-side coherency at cluster scale, where its signal-level wire is cheapest. CHI is for system-level fabric.
17
ACE's Scaling Limits
ACE's basic model is snoop-everyone. At 2 clusters → 1 peer to snoop. At 8 clusters → 7 peers per miss.
Snoop filters amortise but don't eliminate the fan-out.
The AC / CR / CD channels are signal-level buses — every master port has them physically. For 32+ masters, the wire count becomes unbearable on a reticle-sized die.
Solution: move to a packet-based protocol where any number of nodes can share the same physical transport (mesh NoC). That protocol is CHI (next deck).
The 2015 crisis
Arm's early server ambitions (Cortex-A57 16-core systems via CCI-550 + CCN-504) ran into ACE's scaling wall. The CCN-504 Cache Coherent Network interconnect used an internal CHI-like ring, but the customer-visible ACE ports were still signal-level. For Neoverse-class designs Arm pushed customers to CHI at the master interface too.
Today ACE is the right choice at cluster scale (≤8 CPUs); CHI is the right choice at system scale (32–128+ coherent nodes).
Mali GPUs (T6xx, T7xx, G-series) — ACE-Lite for coherent buffer sharing with the CPU.
Video decoders & ISPs — ACE-Lite so decoded frames are visible to CPU caches.
Cortex-R coherent peers in R82 (2020) — real-time cores joining an A-profile coherency domain.
Arm SMMU MMU-600/MMU-700 — participates in DVM on behalf of I/O devices.
19
Minimal ACE-Lite Master — Almost Free
The cheapest way to be "coherent"
An ACE-Lite master has no snoop channels. Physically it's just an AXI4 master with a few extra signals (AxDOMAIN, AxSNOOP, AxBAR).
If all you need is I/O coherency ("CPU cache sees my writes"), you can bolt ACE-Lite onto an existing AXI master by driving the new attributes to sensible constants — zero extra logic.
Interconnect (CCI) sees those constants and snoops the CPU on behalf of the master.
This is why almost every modern DMA engine gained "coherency" in one RTL afternoon — the work was on the interconnect side, not on the master.
Six assigns. That's it. Total new gates for coherent DMA = 0 — the CCI / CMN does all the work.
20
Interview-Ready Takeaways
"What are the five ACE cache states?" → UC, UD, SC, SD, I. They map to E, M, S, O, I in MOESI.
"What's the difference between ReadShared and ReadUnique?" → ReadShared accepts shared copies elsewhere; ReadUnique invalidates all peers so the requester can modify.
"What are the three snoop channels?" → AC (snoop address, interconnect → cache), CR (snoop response), CD (snoop data).
"Why ACE-Lite for GPUs?" → GPUs are producers/consumers of memory, not holders of cache state. ACE-Lite lets them enter the coherency domain one-way, without exposing unused snoop channels.
"What is DVM?" → Distributed Virtual Memory — TLB/I-cache maintenance messages piggybacked on the ACE snoop channels so SMP software doesn't need cross-CPU IPIs for every TLBI.
"Why does a snoop filter matter?" → It turns a broadcast snoop into a targeted one. At 8+ masters, broadcasting is 10× more expensive than filtering.
"Why doesn't ACE scale to 32 coherent masters?" → Point-to-point signal-level wiring; the AC/CR/CD channels must physically reach every master port. Solved by moving to packet-based CHI.
Arm Ltd. — AMBA AXI and ACE Protocol Specification (IHI 0022) — contains the full ACE / ACE-Lite / ACE5 definitions Arm Ltd. — CoreLink CCI-400 Cache Coherent Interconnect Technical Reference Manual Arm Ltd. — CCI-500 / CCI-550 Technical Reference Manuals — snoop filter design & sizing Arm Ltd. — Cortex-A15 MPCore TRM — reference ACE master definition Hennessy & Patterson — Computer Architecture: A Quantitative Approach, 6th ed. — chapter on multiprocessor cache coherence (MOESI variants) Sorin, D., Hill, M., Wood, D. — A Primer on Memory Consistency and Cache Coherence (Synthesis Lectures, 2011) — the canonical textbook on directory vs snooping protocols Sridharan, S. et al. — ACE Protocol and Coherency — Arm Tech Summit presentations (2013–2016) Martonosi, M. et al. — papers on DVM scalability in Arm SoCs (ASPLOS, MICRO) Wikipedia — "MOESI protocol", "Cache coherence" — well-sourced cross-references
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.