ARM AMBA · PRESENTATION 04

ACE & ACE-Lite

Bringing Cache Coherency to AXI · Five Cache States, Three Snoop Channels
AC · CR · CD · UC / UD / SC / SD / I · ReadShared · ReadUnique · DVM · CCI-400/500/550
Navigate: → ←  |  Overview: Esc  |  Fullscreen: F
02

Why Coherency? Why AMBA?

  • By 2010, multi-core Arm SoCs (Cortex-A9 quad) had reached the point where software-managed cache maintenance was no longer acceptable.
  • The Linux kernel needed hardware cache coherency for SMP scheduling and page-table updates to work reasonably.
  • big.LITTLE (Cortex-A7 + A15, 2011) made it unavoidable — the OS had to migrate threads between clusters with live cache state.
  • Arm needed a bus protocol that could carry coherency primitives. Bolting them onto AXI gave ACE — AXI Coherency Extensions — in AMBA 4 (2010).

ACE in one sentence

ACE adds three snoop channels to AXI and extends the transaction type space so a requesting cache can signal what it wants (Shared, Unique, etc.) and peer caches can report back what they have.

Arm publishes ACE alongside AXI in a single specification (IHI 0022). Every AXI signal still exists; ACE just layers on top.
03

MOESI — Refresher

  • Classical cache-coherency states, named after the five-state protocol popularised by AMD / SPARC:
    • Modified — dirty, exclusive, must write back
    • Owned — dirty, shared (this cache responsible for write-back)
    • Exclusive — clean, only copy
    • Shared — clean, may be copies elsewhere
    • Invalid — not valid
  • ACE uses a variant with five equivalent states but renamed to fit AXI's semantics — UniqueClean / UniqueDirty / SharedClean / SharedDirty / Invalid.

ACE ↔ MOESI mapping

ACEMOESI
UniqueClean (UC)E
UniqueDirty (UD)M
SharedClean (SC)S
SharedDirty (SD)O
Invalid (I)I

The MOESI "O" (Owned) state — the one responsible for supplying clean copies to peers and eventually writing back — is SharedDirty in ACE-speak.

04

The Five ACE Cache States

UniqueClean

This cache holds the only copy. Memory is up to date. Equivalent to E.

UniqueDirty

Only copy; memory is stale. Must write back before eviction. Equivalent to M.

SharedClean

Possibly shared with peers. Memory may be up-to-date or another peer may own a dirty copy. Equivalent to S.

SharedDirty

Shared with peers; this cache owns the dirty copy and is responsible for write-back. Equivalent to O.

Invalid

Line not present / dropped.

Key invariants

  • At most one cache can be Unique (UC or UD) for a given line.
  • At most one cache can be SharedDirty for a given line.
  • Multiple caches can be SharedClean simultaneously.

What's "Dirty"?

A cache line is Dirty if the data in it differs from the underlying memory's value. The responsibility to eventually write-back is always with exactly one cache (UD or SD).

05

ACE Channels — AXI Plus Three

  • On top of the five AXI channels (AR, R, AW, W, B), ACE adds three snoop channels:
    • AC — Snoop Address (interconnect → cache)
    • CR — Snoop Response (cache → interconnect)
    • CD — Snoop Data (cache → interconnect, optional)
  • When a new request arrives at the interconnect, the Home logic broadcasts an AC snoop to all peer caches; each responds with CR (state + hit/miss), and if it has modified data, drives CD.
  • Plus a DVM logical channel (Distributed Virtual Memory) — piggybacks on the snoop channels for TLB invalidation, I-cache maintenance, and barrier messages.
ACE master ports ACE Master CCI / ACE Interconnect AR/AW R/B W AC (snoop) CR (resp) CD (data) 5 AXI channels + 3 snoop channels = 8 total
06

ACE Transaction Types — Read Side

TransactionWhen usedFinishing state
ReadOnceOne-shot snoop read; don't cache(not cached)
ReadCleanLine fill expecting clean data onlyUC or SC
ReadNotSharedDirtyLine fill, willing to accept any state except SDUC / UD / SC
ReadSharedRead expecting to share itUC/UD/SC/SD
ReadUniqueRead intending to modify — invalidate all peersUC or UD
CleanUniqueAlready have SC/SD, need to upgrade to unique before writingUC
MakeUniqueOverwrite entire line — no data transfer neededUD

Why so many?

Each type tells the coherency logic exactly what ownership and data transfer are needed — lets the interconnect do the minimum work. A cache-line upgrade (SC → UC) needs no data movement at all, only invalidation messages.

CleanUnique vs MakeUnique: CleanUnique keeps the current data valid (reading before writing); MakeUnique discards current data (full-line overwrite, like memset). MakeUnique saves bus bandwidth on whole-line writes.
07

ACE Transaction Types — Write Side

TransactionPurpose
WriteUniqueNon-cached write; snoops others to invalidate first
WriteLineUniqueFull-line non-cached write
WriteBackEvict a dirty line (UD or SD) to memory
WriteCleanFlush dirty line but keep cached (UD → UC)
EvictInform coherency that a clean line was dropped
WriteEvictWrite-back + invalidate in one step

Why "Evict" (clean drop)?

A snoop filter inside a coherent interconnect needs to know which caches currently hold which lines — if a CPU silently drops a clean line, the filter's tracking diverges. ACE requires explicit Evict for clean drops so the filter can reclaim that entry.

WriteEvict is the efficient case: you want to write back and drop, in one transaction. Used at end-of-workload or on LRU eviction of a modified line.
08

Snoop Transactions — from the Interconnect

  • When a ReadShared / ReadUnique / CleanUnique arrives at the interconnect, it broadcasts a snoop to every peer cache on their AC channel.
  • Snoop types mirror the request semantics:
    • ReadOnce / ReadClean / ReadNotSharedDirty / ReadShared / ReadUnique — peer must supply data if present, and transition its state accordingly.
    • CleanInvalid — peer must write back dirty data, then invalidate.
    • MakeInvalid — peer drops the line without writing back (used when the requester intends to overwrite the whole line).
  • Peer responds with a 5-bit CR[4:0] encoding hit/miss, passed-dirty, passed-data, error.

CRRESP encoding

CRRESP[0] DataTransfer — CD active
CRRESP[1] Error         — snoop failed
CRRESP[2] PassDirty     — responder owns dirty
CRRESP[3] IsShared      — responder keeps shared
CRRESP[4] WasUnique     — responder was unique
A peer that's already Invalid for a line can respond with CRRESP=0 — "I have nothing." This is the common case; most snoops miss most peers.
09

A Coherent Read — Full Flow

CPU0 issues ReadShared; CPU1 holds the line in UD (Modified) CPU0 / Cache ACE Master CCI-400 (Arbiter + Snoop Filter) CPU1 / Cache Line state: UD DRAM controller 1. AR: ReadShared, ID=4, AxDOMAIN=Inner 2. AC: ReadShared snoop 3. CR: PassDirty=1, DataTransfer=1 4. CD: cache line contents 5. R: data + resp (CPU0 now SD) WriteBack to DRAM optional

CPU1 was UniqueDirty; after the snoop it becomes SharedDirty (retains ownership of dirty copy). CPU0 becomes SharedClean — it has the data, but responsibility for write-back stays with CPU1. No DRAM write was required.

10

AxDOMAIN & AxSNOOP

AxDOMAIN[1:0]

Controls the scope of snooping:

  • 00 Non-shareable — no snoop needed
  • 01 Inner Shareable — snoop within the Inner domain (typically a cluster)
  • 10 Outer Shareable — snoop across the outer domain (multiple clusters, system cache)
  • 11 System — snoop to memory system

AxSNOOP[3:0]

Encodes the specific transaction type (ReadOnce / ReadShared / CleanUnique / etc.). Combined with AxDOMAIN, the interconnect knows exactly who to snoop and how.

Mapping from MMU to ACE

Armv8 page tables specify Shareability (Inner / Outer / None) per-region. The CPU combines that with the memory-type to produce AxDOMAIN + AxCACHE — the ACE interconnect then enforces the coherency domain.

Performance hint: marking a big DMA buffer Non-shareable saves every snoop cycle that would otherwise be wasted. Many OS drivers map DMA memory as Non-shareable for this reason.
11

ACE-Lite — I/O Coherency

  • Most peripherals aren't caches — they're producers/consumers of memory. They don't need to hold cache state, but they benefit enormously from the CPU's caches being coherent with their traffic.
  • ACE-Lite is a restricted ACE profile for such agents. Signals from the master side:
    • AXI + AxDOMAIN + AxSNOOP (so the interconnect can snoop CPU caches).
    • No AC / CR / CD channels on the master (it has no caches to snoop).
  • The master reads from / writes to memory, and the interconnect automatically snoops CPU caches to return fresh data or invalidate stale copies.
  • This is the coherent DMA / GPU / NIC / NPU interface.

One-way coherency

An ACE-Lite master is coherency-transparent: the CPU stays coherent with it, but the master itself has no caches. No peer snoops the master — it would find nothing.

Cortex-A57 + Mali GPU over CCI-500 is the canonical example: the GPU uses ACE-Lite, so memory writes from CPUs are visible to the GPU without software cache maintenance.
12

DVM — Distributed Virtual Memory

  • Cache coherency is only half of SMP correctness — the other half is TLB coherency.
  • When one CPU invalidates a page-table entry (e.g. page removed, permission changed), every other CPU's TLBs must drop the corresponding entries before the OS can reuse the page.
  • DVM messages travel on the ACE snoop channels (AC, with a special SNOOP type) and carry:
    • TLB Invalidate (TLBI) — by VA, by ASID, by all
    • Branch Predictor Invalidate (BPI)
    • Instruction-cache Invalidate (ICI)
    • Sync — pipelined barrier
  • Every receiver must acknowledge; a DVM Sync waits for all acks before proceeding.

DVM + TLBI IS instructions

When software executes TLBI IS, ... on Armv8, the CPU issues a DVM TLBI to the interconnect, which broadcasts it to every Inner-Shareable participant. The subsequent DSB ISH waits on the DVM Sync acks.

DVM is often the bottleneck on workloads with heavy fork/mmap churn. Arm server chips put serious engineering into DVM fan-out so it doesn't serialise.
13

CCI-400 / CCI-500 / CCI-550

CCIYearACE mastersACE-Lite slaves
CCI-40020112 (clusters)3
CCI-500201447, + snoop filter
CCI-550201567, larger snoop filter

CCI = Cache Coherent Interconnect. Arm's off-the-shelf IP used in virtually every big.LITTLE and LITTLE-only SoC between 2011 and 2016.

Why a snoop filter mattered

CCI-400 broadcast every snoop to every ACE master. With 2 clusters × 4 CPUs, fine. With 6–8 clusters, every unrelated miss generated up to 7 snoops, overwhelming the ACE links.

CCI-500 added a snoop filter — an inclusive directory of which caches hold which lines — so only relevant snoops are broadcast. A 95%+ snoop-reduction rate was typical.

CCI-550 was the last generation of CCI. Beyond 8 coherent masters, the broadcast architecture became uneconomical — which motivated the CHI transition.
14

Snoop Filters — How They Work

  • The snoop filter is a directory: for each cache line currently cached anywhere in the system, track which masters hold it.
  • Typically implemented as a set-associative tag structure, inclusive of the caches it tracks.
  • On a coherent request:
    • Lookup the address in the filter.
    • If no master holds it → skip snoop entirely, go direct to memory.
    • If one or more masters hold it → snoop only those.
  • Filter capacity must be ≥ the sum of tracked cache capacities (or suffer back-invalidates to the slowest cache when the filter overflows).

Filter as a cache

A too-small snoop filter becomes a bottleneck: on filter-miss-with-back-invalidate, a potentially-valid cache line gets force-invalidated just to free a filter entry, causing the CPU to miss later.

Arm's CCI-500/550 filters are typically sized 1.2–1.5× the aggregate cache capacity across the clusters they serve. CHI's Home Node filters (next deck) go further — they're often oversized and fully inclusive.
15

ACE Domains & Sharing Scope

Inner (cluster 0) CPU A CPU B CPU C CPU D Inner (cluster 1) CPU E CPU F Outer Shareable — CCI + ACE-Lite masters Mali GPU DMA NIC Inner scope = one cluster; Outer = whole coherent domain
  • A CPU accessing a stack variable marks it Inner Shareable — only its own cluster needs to snoop.
  • A kernel data structure shared across clusters is marked Outer Shareable — all clusters + all ACE-Lite masters snoop.
  • DMA-mapped network buffer: typically Outer Shareable if the CPU and the NIC genuinely share it; Non-shareable if the driver manages cache maintenance manually.
  • This is why Armv8 OS kernels tune page table attributes carefully — the wrong shareability domain can be a 10% performance bug.
16

ACE5 — AMBA 5 Refresh

  • ACE5 (2017) brought AXI5's improvements to the coherent profile:
    • Atomic transactions — same offload as AXI5, but can participate in coherency (other caches get snooped accordingly).
    • Cache stashing — an ACE master can hint that a cache line be installed into a target cache's state (UC/SC).
    • Cache Maintenance Operations (CMO) — system-wide cache clean/invalidate over the bus rather than CPU-executed loops.
    • MTE support — tag bits on every coherent transaction.
    • RME hooks — AxNSE for 4-way security states (Root/Realm/Secure/Non-Secure).

ACE5-Lite

The I/O-coherent profile got AMBA 5'd as well — same changes, but without the master-side snoop channels. Every modern NPU or GPU that wants to see CPU cache writes uses ACE5-Lite.

Common question: "Why both ACE5 and CHI?" — Arm keeps ACE5 for master-side coherency at cluster scale, where its signal-level wire is cheapest. CHI is for system-level fabric.
17

ACE's Scaling Limits

  • ACE's basic model is snoop-everyone. At 2 clusters → 1 peer to snoop. At 8 clusters → 7 peers per miss.
  • Snoop filters amortise but don't eliminate the fan-out.
  • The AC / CR / CD channels are signal-level buses — every master port has them physically. For 32+ masters, the wire count becomes unbearable on a reticle-sized die.
  • Beyond ~8 coherent masters, the protocol's point-to-point wiring assumption runs out.
  • Solution: move to a packet-based protocol where any number of nodes can share the same physical transport (mesh NoC). That protocol is CHI (next deck).

The 2015 crisis

Arm's early server ambitions (Cortex-A57 16-core systems via CCI-550 + CCN-504) ran into ACE's scaling wall. The CCN-504 Cache Coherent Network interconnect used an internal CHI-like ring, but the customer-visible ACE ports were still signal-level. For Neoverse-class designs Arm pushed customers to CHI at the master interface too.

Today ACE is the right choice at cluster scale (≤8 CPUs); CHI is the right choice at system scale (32–128+ coherent nodes).
18

ACE in the Real World

ACE systems that shipped big

  • Samsung Exynos 5 Octa (2013) — Cortex-A15 + A7 big.LITTLE over CCI-400.
  • Qualcomm Snapdragon 820 (2016) — Kryo clusters over CCI-550.
  • Mediatek Helio X10/X20 (2015–16) — deca-core designs via CCI-500.
  • Apple A6 → A11 — bespoke interconnects with ACE interfaces (though Apple runs custom cores).
  • Amlogic, Allwinner, RockChip — countless set-top / automotive SoCs.

ACE-Lite on the periphery

  • Mali GPUs (T6xx, T7xx, G-series) — ACE-Lite for coherent buffer sharing with the CPU.
  • Video decoders & ISPs — ACE-Lite so decoded frames are visible to CPU caches.
  • Cortex-R coherent peers in R82 (2020) — real-time cores joining an A-profile coherency domain.
  • Arm SMMU MMU-600/MMU-700 — participates in DVM on behalf of I/O devices.
19

Minimal ACE-Lite Master — Almost Free

The cheapest way to be "coherent"

  • An ACE-Lite master has no snoop channels. Physically it's just an AXI4 master with a few extra signals (AxDOMAIN, AxSNOOP, AxBAR).
  • If all you need is I/O coherency ("CPU cache sees my writes"), you can bolt ACE-Lite onto an existing AXI master by driving the new attributes to sensible constants — zero extra logic.
  • Interconnect (CCI) sees those constants and snoops the CPU on behalf of the master.
This is why almost every modern DMA engine gained "coherency" in one RTL afternoon — the work was on the interconnect side, not on the master.
// Thin ACE-Lite wrapper around an AXI4 master.
// No snoop channels exist on the master side.
module acel_wrap #(parameter ID_W = 4) (
  // existing AXI4 port (connect to your master IP)
  output logic [ID_W-1:0] ARID,
  output logic [31:0]     ARADDR,
  output logic [7:0]      ARLEN,
  output logic [2:0]      ARSIZE,
  output logic [1:0]      ARBURST,
  output logic            ARVALID,
  input  logic            ARREADY,
  // ... (the rest of AR/R/AW/W/B as usual)

  // ACE-Lite extra attributes — pure combinational tie-offs
  output logic [1:0]      ARDOMAIN,
  output logic [3:0]      ARSNOOP,
  output logic [1:0]      ARBAR,
  output logic [1:0]      AWDOMAIN,
  output logic [2:0]      AWSNOOP,
  output logic [1:0]      AWBAR
);
  // I/O-coherent DMA targeting shared memory:
  //   Outer Shareable, ReadOnce / WriteUnique,
  //   No barriers.
  assign ARDOMAIN = 2'b10;    // Outer Shareable
  assign ARSNOOP  = 4'b0000;  // ReadOnce
  assign ARBAR    = 2'b00;    // No barrier

  assign AWDOMAIN = 2'b10;    // Outer Shareable
  assign AWSNOOP  = 3'b001;   // WriteUnique
  assign AWBAR    = 2'b00;
endmodule

Six assigns. That's it. Total new gates for coherent DMA = 0 — the CCI / CMN does all the work.

20

Interview-Ready Takeaways

  • "What are the five ACE cache states?" → UC, UD, SC, SD, I. They map to E, M, S, O, I in MOESI.
  • "What's the difference between ReadShared and ReadUnique?" → ReadShared accepts shared copies elsewhere; ReadUnique invalidates all peers so the requester can modify.
  • "What are the three snoop channels?" → AC (snoop address, interconnect → cache), CR (snoop response), CD (snoop data).
  • "Why ACE-Lite for GPUs?" → GPUs are producers/consumers of memory, not holders of cache state. ACE-Lite lets them enter the coherency domain one-way, without exposing unused snoop channels.
  • "What is DVM?" → Distributed Virtual Memory — TLB/I-cache maintenance messages piggybacked on the ACE snoop channels so SMP software doesn't need cross-CPU IPIs for every TLBI.
  • "Why does a snoop filter matter?" → It turns a broadcast snoop into a targeted one. At 8+ masters, broadcasting is 10× more expensive than filtering.
  • "Why doesn't ACE scale to 32 coherent masters?" → Point-to-point signal-level wiring; the AC/CR/CD channels must physically reach every master port. Solved by moving to packet-based CHI.
  • "What did ACE5 add?" → Atomics, cache stashing, CMO, MTE tags, RME hooks.
21

References

Arm Ltd.AMBA AXI and ACE Protocol Specification (IHI 0022) — contains the full ACE / ACE-Lite / ACE5 definitions
Arm Ltd.CoreLink CCI-400 Cache Coherent Interconnect Technical Reference Manual
Arm Ltd.CCI-500 / CCI-550 Technical Reference Manuals — snoop filter design & sizing
Arm Ltd.Cortex-A15 MPCore TRM — reference ACE master definition
Hennessy & PattersonComputer Architecture: A Quantitative Approach, 6th ed. — chapter on multiprocessor cache coherence (MOESI variants)
Sorin, D., Hill, M., Wood, D.A Primer on Memory Consistency and Cache Coherence (Synthesis Lectures, 2011) — the canonical textbook on directory vs snooping protocols
Sridharan, S. et al.ACE Protocol and Coherency — Arm Tech Summit presentations (2013–2016)
Martonosi, M. et al. — papers on DVM scalability in Arm SoCs (ASPLOS, MICRO)
Wikipedia — "MOESI protocol", "Cache coherence" — well-sourced cross-references

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.