ARM CORTEX-A · PRESENTATION 03

Memory System — VMSA, Caches & Ordering

Translation · TLBs · Cache hierarchy · Weak ordering · LSE atomics

VMSAv8-64 · 4K/16K/64K granules · ASID/VMID · PoU/PoC · DMB/DSB/ISB · LDAR/STLR · LSE

Why Cortex-A Memory is Different

Cortex-M uses an MPU (region-based, no translation). Cortex-A uses a full MMU with paged virtual memory — the VMSA (Virtual Memory System Architecture).
Every memory access goes through Stage-1 (OS) translation, and possibly Stage-2 (hypervisor) translation — that's up to 24 page-table walks for one access in the worst case.
Memory is strongly weakly ordered — loads and stores can be reordered freely unless explicit barriers or release/acquire operations are used.
The memory system is also a multi-core story: caches must be coherent across cores + GPU + NPU + DMA engines, using AXI/ACE or CHI.

Two memory models, same CPU

Armv8-A has a weak default, but with optional release-consistency (RCsc) and, from v8.3, RCpc (processor-consistent) loads. Java volatile, C++ seq_cst, and Linux kernel RCU all compile into these primitives.

Stage-1 + Stage-2 is a feature

Dual-stage translation is how KVM/Android pKVM gives guests their own virtual address spaces cheaply — the guest-physical → host-physical mapping is Stage-2, invisible to the guest OS.

VMSAv8-64 — Three Granule Sizes

Granule	Typical use	Levels for 48-bit VA	Block sizes available
4 KB	Linux default; most servers/phones	4 (L0→L3)	4 KB page, 2 MB block, 1 GB block
16 KB	Apple iOS/macOS default	4 (L0→L3)	16 KB page, 32 MB block
64 KB	Some server workloads, huge-page	3 (L1→L3)	64 KB page, 512 MB block

Each Translation Table level splits 9 bits of VA (for 4 KB) — Level 0..3 use VA[47:39], [38:30], [29:21], [20:12]. 16 KB uses 11-bit indexing per level; 64 KB uses 13-bit indexing.

Block descriptors = huge pages

Instead of walking all the way down to a 4 KB page, an L1 descriptor can directly map a 1 GB block. Linux uses this for hugetlb and for the 1:1 kernel map.

Armv8.2-A: 52-bit VA / PA

LVA (Large Virtual Address) extends VA to 52 bits via a 5th level (L-1 for 4 KB granule). LPA extends PA to 52 bits with higher-order bits in the descriptor.

VA Layout — TTBR0 vs TTBR1

TTBR0_EL1 — user page-table root. Changed on every context_switch (tagged by ASID).
TTBR1_EL1 — kernel page-table root. Stays the same for all processes.
Effective VA width controlled by TCR_EL1.T0SZ / T1SZ — T0SZ=16 gives 48-bit user VA; T0SZ=12 gives 52-bit (LVA).
If VA[63:48] is neither all-0 nor all-1, the CPU faults with a translation fault — the "canonical address" check.
TBI (Top Byte Ignore) — allows bits [63:56] to carry a tag (used by MTE and HWASAN) without faulting.

The Page-Table Walker

On a TLB miss the Hardware Table Walker (HTW) walks the in-memory page tables without any software help.
Each descriptor is 8 bytes with a type (block / table / invalid) and PA + attribute bits.
Walk is recursive: L0 → L1 → L2 → L3. Each step issues a memory read to the PA of the next table.
A walk can itself miss in the cache and go to DRAM — expensive. Modern cores prefetch walks into a dedicated walker cache (sometimes called an "intermediate TLB" or "paging-structure cache").
On Stage-1+Stage-2, each Stage-1 walk-step is itself Stage-2-translated — worst case 24 reads per VA translation.

// Simplified 4 KB, 48-bit VA walk (4 levels)

VA[47:39]  = L0 index (PGD)
VA[38:30]  = L1 index (PUD)
VA[29:21]  = L2 index (PMD)
VA[20:12]  = L3 index (PTE)
VA[11:0]   = page offset

PTE descriptor layout (simplified):
  [0]     valid
  [1]     table (1) or block/page (0)
  [11:2]  lower attrs (AP, SH, AttrIndx, AF, nG)
  [47:12] output address (PA bits)
  [63:51] upper attrs (PBHA, GP, DBM, Contig, PXN, UXN)

// AF = Access Flag: HW can set on access, or
// software-set (older cores). Armv8.1 adds HW update.

ASID & VMID — Tagged TLB Entries

ASID (Address Space ID) — 8- or 16-bit tag in every TLB entry for Stage-1 user translations. Comes from TTBR0_EL1[63:48].
Lets the CPU switch user page tables without flushing the TLB — entries for the old process stay, tagged with its ASID.
Kernel entries are global (nG=0) — not tagged by ASID, so they're visible from every process without duplication.
VMID (VM ID) — similar concept at Stage-2, tags hypervisor guest translations. 8- or 16-bit. Set in VTTBR_EL2[63:48].
Linux rolls its ASID allocator when all ASIDs used; rollover triggers a broadcast TLBI ASIDE1IS if ASID reassigned.

Why tags matter

Every context-switch would otherwise flush L1 TLB (~48-128 entries). Re-filling that costs hundreds of cycles. ASID turns context switch into a near-free operation for the TLB.

// TLBI maintenance ops (outer sharable, all ELs)

tlbi  vmalle1is          // all entries at EL1 (local ASID)
tlbi  aside1is, x0       // all entries matching ASID in x0
tlbi  vae1is, x1         // single VA (encoded in x1)
tlbi  vmalls12e1is       // all Stage-1+2 for this VMID
dsb   ish                // wait for it
isb                      // restart pipeline fetch

Memory Types & Attributes

Type	Cacheable	Reorder?	Speculation?	Use
Normal	Yes (with policy)	Yes (weak)	Yes	DRAM, cacheable regions
Device-GRE	No	Gathered, reordered, early-ack	No	High-throughput I/O (graphics buffers)
Device-nGRE	No	Reordered, early-ack	No	Less permissive I/O
Device-nGnRE	No	Early-ack only	No	Typical peripherals (GIC, UART)
Device-nGnRnE	No	None	No	Strictly-ordered — legacy "strongly-ordered"

"nG" = non-Gathering, "nR" = non-Reordering, "nE" = no Early write-ack. "Normal" memory also has an Inner & Outer cache policy (WB-WA, WT, NC). These are all encoded in an 8-bit MAIR_ELx slot, and the PTE picks a slot by 3-bit AttrIndx.

MAIR indirection means the attribute table can be rewritten once and all page-table entries immediately pick up the new meaning — very useful for quickly switching between write-through and write-back without touching every PTE.

Cache Hierarchy in a Modern Cortex-A SoC

L1-I / L1-D — split Harvard, per-core. Usually 32-64 KB, 2-4-way. VIPT tag (virtually-indexed, physically-tagged) — avoids synonyms.
L2 — per core in modern cores (A76 onwards). 256 KB-3 MB. Inclusive or mostly-inclusive of L1.
L3 (DSU) — shared by the whole DynamIQ cluster. Up to 16 MB. Managed by the DSU.
SLC (System Level Cache) — attached to the CMN mesh/NoC. Sits on the path from cluster to DRAM. Shared with GPU, NPU, display.
Line size is 64 bytes across the whole A-profile family.

PoU vs PoC vs PoPS

Point of Unification (PoU) — the level where I-cache and D-cache see the same data. Usually L2 or L3. Required for self-modifying code: after writing to memory, clean D-cache to PoU + invalidate I-cache to PoU.
Point of Coherency (PoC) — the level where all coherent masters see the same data. Usually the SLC or just before DRAM. Required for DMA & non-coherent masters.
Point of Persistence (PoPS) — Armv8.2 addition for non-volatile memory (NVDIMM, battery-backed). Not widely used on mobile.

// JIT compiler flushes after writing code
// (write-then-execute pattern)

bl    generate_code        // builds into [x0 .. x0+len]
mov   x2, x0               // save base

1:  dc    cvau, x2         // clean D to PoU
    add   x2, x2, #64
    cmp   x2, x3
    b.lo  1b
dsb   ish                  // wait for cleans

ic    ivau, x0             // invalidate I to PoU (by VA)
dsb   ish
isb                        // restart fetch

br    x0                   // now safe to execute

This sequence is what every JIT (V8, JSC, JVM) does on AArch64. Armv8.1+ cores can speed it up via CTR_EL0.DIC/IDC — if set, parts of the sequence are redundant.

Cache Maintenance Instructions

Instruction	What it does	Typical use
IC IALLU	Invalidate all I-cache (local PE)	After writing code pages
IC IVAU, Xt	Invalidate I-cache by VA to PoU	JIT flush (see prev slide)
DC IVAC, Xt	Invalidate D-cache by VA to PoC	Before DMA from device → memory
DC CVAC, Xt	Clean D-cache by VA to PoC	Before DMA from memory → device
DC CIVAC, Xt	Clean & invalidate by VA to PoC	Bidirectional DMA buffers
DC CVAU, Xt	Clean by VA to PoU	JIT flush pair with IC IVAU
DC ZVA, Xt	Zero a cache line (no allocate)	memset fast-path (replaces STR #0 loop)
DC CIVAP, Xt	Clean & invalidate to PoPS	NVDIMM flush (Armv8.2)

Set/way operations (DC ISW / CSW / CISW) still exist for bring-up but are discouraged — they are not broadcast to other PEs. For runtime flushing, always use by-VA ops.

Weak Memory Ordering — the Arm Model

Arm A-profile is weakly ordered. The only ordering guaranteed is:
- Same-address load→load (coherence)
- Same-address write→read (SC-PO to same location)
- Dependency ordering (address/data dep preserves program order for the dependent access)
Everything else — load→store, store→store, unrelated load→load — can be reordered by hardware.
Programmer recovers ordering with:
- Barriers — DMB / DSB / ISB
- Release/acquire — LDAR / STLR
- LSE atomics (Armv8.1) with acquire/release semantics

"Load buffering" litmus test

Two cores: P0 does LDR r1, [A]; STR r2, [B]; P1 does LDR r3, [B]; STR r4, [A]. Can both loads see initial zero and both stores succeed? On x86: no. On Arm: yes — loads can be reordered with later stores.

Use release/acquire, not barriers

Idiomatic AArch64 code uses LDAR/STLR — the acquire/release semantics pair with the CPU's own store buffer, no separate fence needed. Barriers become rare in modern code.

DMB / DSB / ISB — The Barrier Zoo

Barrier	Scope	Semantics	Typical use
DMB ISH	Inner Shareable	All cores in cluster see loads/stores in order	Lock fence, spinlock release
DMB ISHLD	ISH, loads only	Load→load ordering only	Read-side of RCU
DMB ISHST	ISH, stores only	Store→store ordering only	Write-combining flush
DSB ISH	ISH, system-wide	All prior mem accesses + caches complete	After TLBI / IC / DC
DSB OSH	Outer Shareable	Also includes devices on outer side	DMA start synchronisation
DSB SY	System	Full system barrier	Boot, rare
ISB	Local PE	Instruction Synchronisation — context-synchronising	After system-register write, after cache maintenance

Rule of thumb: DMB orders memory accesses; DSB also waits for them; ISB flushes the pipeline so subsequent fetches see new state. Nearly every use of these on modern systems can be replaced by LDAR/STLR/CAS.

LSE Atomics (Armv8.1-A)

Before LSE, atomic ops were a LDXR / STXR loop — "Load-Linked / Store-Conditional". Under contention it can livelock.
LSE — Large System Extension — adds single-instruction atomics that scale to many-core.
Each takes acquire (A) / release (L) / acquire+release (AL) / plain variants — so ordering is baked in.
Under the hood, big SMP systems implement LSE as a near-atomic sent to the Home Node (HN-F) in CHI — the cache-line's "owner" executes the RMW without the requester needing to own the line.
Kernel compiles to LSE when present (-moutline-atomics produces a runtime picker).

// LSE atomics — all AArch64-only, Armv8.1-A+

cas    w0, w1, [x2]        // compare-and-swap
casal  w0, w1, [x2]        // CAS acquire+release
swp    w0, w1, [x2]        // SWP: atomic exchange

ldadd  w0, w1, [x2]        // atomic fetch-add
ldaxr  / ldxr             // legacy exclusive load

// Idiomatic acquire-semantics spinlock using CAS
acquire:
    mov   w3, #1
1:  mov   w1, wzr
    casa  w1, w3, [x0]     // CAS acquire: expect 0, set 1
    cbnz  w1, 1b           // retry until free
    ret

release:
    stlr  wzr, [x0]        // release-store clears
    ret

Producer / Consumer — Release & Acquire in Practice

// Classic "publish data, then flag" pattern

// Producer (core A):
//   data is filled in, then published
str   x0, [data_ptr]       // 1: write data
stlr  w1, [flag]           // 2: release-store flag=1

// Consumer (core B):
//   spin on flag, then read data
1:  ldar  w2, [flag]       // acquire-load
    cbz   w2, 1b           // loop until non-zero
ldr   x3, [data_ptr]       // safe: sees the pre-STLR write

The STLR in step 2 has release semantics: no prior store by this core may be reordered past it.
The LDAR in step 1 of the consumer has acquire semantics: no subsequent load may be reordered before it.
Together, the STLR-LDAR pair creates a happens-before edge — the consumer is guaranteed to see the producer's earlier STR x0, [data_ptr].
Without release/acquire this program is racy on Arm — the consumer can read stale data. On x86 it happens to work because x86 is TSO.
C++ memory_order_release / memory_order_acquire maps exactly to STLR / LDAR on AArch64.

Prefetch & PRFM

PRFM — hint to the caches; no architectural effect.
Encoded target × type × locality:
- Target: PLD (load) / PLI (instruction) / PST (store)
- Type: L1 / L2 / L3
- Locality: KEEP / STRM (streaming — hint cache it briefly)
Hardware prefetchers also run in parallel — strided, stream, and increasingly Data-Dependent prefetchers in Cortex-X4+.
In modern workloads (DB hashing, graph traversal) explicit PRFM still wins by 10-20% even with a good HW prefetcher.

// Linked-list walk with manual prefetch
// Classic cache-miss-hiding pattern

walk:
    ldr   x1, [x0, #NEXT]     // load next
    cbz   x1, 2f
    prfm  pldl1keep, [x1]     // prefetch next->next
    ldr   x2, [x0, #DATA]     // do the work
    bl    process_node
    mov   x0, x1
    b     walk
2:  ret

DC ZVA is effectively a "prefetch-for-write" — it zeroes a cache line in L1 without a fill from DRAM. Used in memset / calloc fast-paths.

Related Decks

AMBA 03 — AXI Deep Dive — the bus that carries every cache-line refill.
AMBA 04 — ACE & ACE-Lite — coherent transactions, MOESI on AXI.
AMBA 05 — CHI Coherency — what the SLC and snoop filter look like at the mesh level.
Cortex-M 04 — MPU & Memory — contrast with the region-based M-profile model.

Key takeaways

Weak memory model is the single most-tripped-up topic when reasoning about Arm SMP — know STLR / LDAR / LSE.
PoU vs PoC is the JIT & DMA test. Memorise the "clean to PoU, invalidate I to PoU" sequence.
Know what ASID does and why nG=0 kernel entries avoid ASID tagging.
TTBR0 vs TTBR1 split lets Linux keep a single kernel page-table across all processes.

Lessons

"Why tagged TLB with ASID?" → avoids flushing user entries on context switch; cut switch cost from hundreds of cycles to near zero.
"What's the difference between PoU and PoC?" → PoU is where I and D caches see the same data (for self-modifying code); PoC is where everyone, including DMA masters, sees the same data.
"Why does Linux use 4 KB granule mostly?" → x86 compat, small-page granularity for fine-grained mmap. Apple chose 16 KB for better TLB coverage and fewer page-table levels.

"Why LSE atomics?" → LDXR/STXR livelocks under heavy contention; LSE sends a single RMW to the cache-line's home node, scaling to 64+ cores.
"Explain a spinlock on AArch64" → CAS acquire to take, STLR to release, WFE while parked; avoids contended LL/SC livelock.
"When is a DSB needed?" → after TLBI or IC/DC maintenance, to ensure it's globally observable before the ISB; rarely needed for plain data ordering (use release/acquire).
"What is TBI?" → Top Byte Ignore — CPU ignores bits [63:56] of a VA on access. Enables MTE tags and HWASAN.

References

Arm Ltd. — DDI 0487 — VMSA, ordering, cache chapters
Arm Ltd. — Learn the architecture: AArch64 memory management — free, well-written walkthrough
Arm Ltd. — Cache maintenance application note — by-VA vs set/way operations
McKenney, Paul E. — Is Parallel Programming Hard, And, If So, What Can You Do About It? — RCU author, covers Arm memory model
Alglave, Maranget, Tautschnig — "Herding cats: Modelling, simulation, testing, and data-mining for weak memory" (ACM TOPLAS 2014) — Arm memory model formalisation
Linux kernel — Documentation/arm64/memory.rst — layout & ASID notes
Linux kernel — arch/arm64/include/asm/atomic_lse.h — LSE atomic implementations
Preshing, Jeff — preshing.com — excellent memory-ordering intuition pieces

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.