ARM CORTEX-A · PRESENTATION 03

Memory System — VMSA, Caches & Ordering

Translation · TLBs · Cache hierarchy · Weak ordering · LSE atomics
VMSAv8-64 · 4K/16K/64K granules · ASID/VMID · PoU/PoC · DMB/DSB/ISB · LDAR/STLR · LSE
02

Why Cortex-A Memory is Different

  • Cortex-M uses an MPU (region-based, no translation). Cortex-A uses a full MMU with paged virtual memory — the VMSA (Virtual Memory System Architecture).
  • Every memory access goes through Stage-1 (OS) translation, and possibly Stage-2 (hypervisor) translation — that's up to 24 page-table walks for one access in the worst case.
  • Memory is strongly weakly ordered — loads and stores can be reordered freely unless explicit barriers or release/acquire operations are used.
  • The memory system is also a multi-core story: caches must be coherent across cores + GPU + NPU + DMA engines, using AXI/ACE or CHI.

Two memory models, same CPU

Armv8-A has a weak default, but with optional release-consistency (RCsc) and, from v8.3, RCpc (processor-consistent) loads. Java volatile, C++ seq_cst, and Linux kernel RCU all compile into these primitives.

Stage-1 + Stage-2 is a feature

Dual-stage translation is how KVM/Android pKVM gives guests their own virtual address spaces cheaply — the guest-physical → host-physical mapping is Stage-2, invisible to the guest OS.

03

VMSAv8-64 — Three Granule Sizes

GranuleTypical useLevels for 48-bit VABlock sizes available
4 KBLinux default; most servers/phones4 (L0→L3)4 KB page, 2 MB block, 1 GB block
16 KBApple iOS/macOS default4 (L0→L3)16 KB page, 32 MB block
64 KBSome server workloads, huge-page3 (L1→L3)64 KB page, 512 MB block

Each Translation Table level splits 9 bits of VA (for 4 KB) — Level 0..3 use VA[47:39], [38:30], [29:21], [20:12]. 16 KB uses 11-bit indexing per level; 64 KB uses 13-bit indexing.

Block descriptors = huge pages

Instead of walking all the way down to a 4 KB page, an L1 descriptor can directly map a 1 GB block. Linux uses this for hugetlb and for the 1:1 kernel map.

Armv8.2-A: 52-bit VA / PA

LVA (Large Virtual Address) extends VA to 52 bits via a 5th level (L-1 for 4 KB granule). LPA extends PA to 52 bits with higher-order bits in the descriptor.

04

VA Layout — TTBR0 vs TTBR1

64-bit VA split (Armv8-A, Linux) 0xFFFF_0000_0000_0000 – 0xFFFF_FFFF_FFFF_FFFF Kernel half — TTBR1_EL1 Canonical hole — faults on access 0x0000_0000_0000_0000 – 0x0000_FFFF_FFFF_FFFF User half — TTBR0_EL1 (per process) VA[63:48] must be all-0 or all-1 ("canonical")
  • TTBR0_EL1 — user page-table root. Changed on every context_switch (tagged by ASID).
  • TTBR1_EL1 — kernel page-table root. Stays the same for all processes.
  • Effective VA width controlled by TCR_EL1.T0SZ / T1SZT0SZ=16 gives 48-bit user VA; T0SZ=12 gives 52-bit (LVA).
  • If VA[63:48] is neither all-0 nor all-1, the CPU faults with a translation fault — the "canonical address" check.
  • TBI (Top Byte Ignore) — allows bits [63:56] to carry a tag (used by MTE and HWASAN) without faulting.
05

The Page-Table Walker

  • On a TLB miss the Hardware Table Walker (HTW) walks the in-memory page tables without any software help.
  • Each descriptor is 8 bytes with a type (block / table / invalid) and PA + attribute bits.
  • Walk is recursive: L0 → L1 → L2 → L3. Each step issues a memory read to the PA of the next table.
  • A walk can itself miss in the cache and go to DRAM — expensive. Modern cores prefetch walks into a dedicated walker cache (sometimes called an "intermediate TLB" or "paging-structure cache").
  • On Stage-1+Stage-2, each Stage-1 walk-step is itself Stage-2-translated — worst case 24 reads per VA translation.
// Simplified 4 KB, 48-bit VA walk (4 levels)

VA[47:39]  = L0 index (PGD)
VA[38:30]  = L1 index (PUD)
VA[29:21]  = L2 index (PMD)
VA[20:12]  = L3 index (PTE)
VA[11:0]   = page offset

PTE descriptor layout (simplified):
  [0]     valid
  [1]     table (1) or block/page (0)
  [11:2]  lower attrs (AP, SH, AttrIndx, AF, nG)
  [47:12] output address (PA bits)
  [63:51] upper attrs (PBHA, GP, DBM, Contig, PXN, UXN)

// AF = Access Flag: HW can set on access, or
// software-set (older cores). Armv8.1 adds HW update.
06

ASID & VMID — Tagged TLB Entries

  • ASID (Address Space ID) — 8- or 16-bit tag in every TLB entry for Stage-1 user translations. Comes from TTBR0_EL1[63:48].
  • Lets the CPU switch user page tables without flushing the TLB — entries for the old process stay, tagged with its ASID.
  • Kernel entries are global (nG=0) — not tagged by ASID, so they're visible from every process without duplication.
  • VMID (VM ID) — similar concept at Stage-2, tags hypervisor guest translations. 8- or 16-bit. Set in VTTBR_EL2[63:48].
  • Linux rolls its ASID allocator when all ASIDs used; rollover triggers a broadcast TLBI ASIDE1IS if ASID reassigned.

Why tags matter

Every context-switch would otherwise flush L1 TLB (~48-128 entries). Re-filling that costs hundreds of cycles. ASID turns context switch into a near-free operation for the TLB.

// TLBI maintenance ops (outer sharable, all ELs)

tlbi  vmalle1is          // all entries at EL1 (local ASID)
tlbi  aside1is, x0       // all entries matching ASID in x0
tlbi  vae1is, x1         // single VA (encoded in x1)
tlbi  vmalls12e1is       // all Stage-1+2 for this VMID
dsb   ish                // wait for it
isb                      // restart pipeline fetch
07

Memory Types & Attributes

TypeCacheableReorder?Speculation?Use
NormalYes (with policy)Yes (weak)YesDRAM, cacheable regions
Device-GRENoGathered, reordered, early-ackNoHigh-throughput I/O (graphics buffers)
Device-nGRENoReordered, early-ackNoLess permissive I/O
Device-nGnRENoEarly-ack onlyNoTypical peripherals (GIC, UART)
Device-nGnRnENoNoneNoStrictly-ordered — legacy "strongly-ordered"

"nG" = non-Gathering, "nR" = non-Reordering, "nE" = no Early write-ack. "Normal" memory also has an Inner & Outer cache policy (WB-WA, WT, NC). These are all encoded in an 8-bit MAIR_ELx slot, and the PTE picks a slot by 3-bit AttrIndx.

MAIR indirection means the attribute table can be rewritten once and all page-table entries immediately pick up the new meaning — very useful for quickly switching between write-through and write-back without touching every PTE.

08

Cache Hierarchy in a Modern Cortex-A SoC

Modern Cortex-A cache hierarchy L1-I L1-D per-core, 32-64 KB L2 (per core) 256 KB – 3 MB L3 — DynamIQ Shared Unit (DSU) SLC — System Level Cache (CMN mesh) DRAM (LPDDR5X, DDR5, HBM)
  • L1-I / L1-D — split Harvard, per-core. Usually 32-64 KB, 2-4-way. VIPT tag (virtually-indexed, physically-tagged) — avoids synonyms.
  • L2 — per core in modern cores (A76 onwards). 256 KB-3 MB. Inclusive or mostly-inclusive of L1.
  • L3 (DSU) — shared by the whole DynamIQ cluster. Up to 16 MB. Managed by the DSU.
  • SLC (System Level Cache) — attached to the CMN mesh/NoC. Sits on the path from cluster to DRAM. Shared with GPU, NPU, display.
  • Line size is 64 bytes across the whole A-profile family.
09

PoU vs PoC vs PoPS

  • Point of Unification (PoU) — the level where I-cache and D-cache see the same data. Usually L2 or L3. Required for self-modifying code: after writing to memory, clean D-cache to PoU + invalidate I-cache to PoU.
  • Point of Coherency (PoC) — the level where all coherent masters see the same data. Usually the SLC or just before DRAM. Required for DMA & non-coherent masters.
  • Point of Persistence (PoPS) — Armv8.2 addition for non-volatile memory (NVDIMM, battery-backed). Not widely used on mobile.
// JIT compiler flushes after writing code
// (write-then-execute pattern)

bl    generate_code        // builds into [x0 .. x0+len]
mov   x2, x0               // save base

1:  dc    cvau, x2         // clean D to PoU
    add   x2, x2, #64
    cmp   x2, x3
    b.lo  1b
dsb   ish                  // wait for cleans

ic    ivau, x0             // invalidate I to PoU (by VA)
dsb   ish
isb                        // restart fetch

br    x0                   // now safe to execute

This sequence is what every JIT (V8, JSC, JVM) does on AArch64. Armv8.1+ cores can speed it up via CTR_EL0.DIC/IDC — if set, parts of the sequence are redundant.

10

Cache Maintenance Instructions

InstructionWhat it doesTypical use
IC IALLUInvalidate all I-cache (local PE)After writing code pages
IC IVAU, XtInvalidate I-cache by VA to PoUJIT flush (see prev slide)
DC IVAC, XtInvalidate D-cache by VA to PoCBefore DMA from device → memory
DC CVAC, XtClean D-cache by VA to PoCBefore DMA from memory → device
DC CIVAC, XtClean & invalidate by VA to PoCBidirectional DMA buffers
DC CVAU, XtClean by VA to PoUJIT flush pair with IC IVAU
DC ZVA, XtZero a cache line (no allocate)memset fast-path (replaces STR #0 loop)
DC CIVAP, XtClean & invalidate to PoPSNVDIMM flush (Armv8.2)

Set/way operations (DC ISW / CSW / CISW) still exist for bring-up but are discouraged — they are not broadcast to other PEs. For runtime flushing, always use by-VA ops.

11

Weak Memory Ordering — the Arm Model

  • Arm A-profile is weakly ordered. The only ordering guaranteed is:
    • Same-address load→load (coherence)
    • Same-address write→read (SC-PO to same location)
    • Dependency ordering (address/data dep preserves program order for the dependent access)
  • Everything else — load→store, store→store, unrelated load→load — can be reordered by hardware.
  • Programmer recovers ordering with:
    • BarriersDMB / DSB / ISB
    • Release/acquireLDAR / STLR
    • LSE atomics (Armv8.1) with acquire/release semantics

"Load buffering" litmus test

Two cores: P0 does LDR r1, [A]; STR r2, [B]; P1 does LDR r3, [B]; STR r4, [A]. Can both loads see initial zero and both stores succeed? On x86: no. On Arm: yes — loads can be reordered with later stores.

Use release/acquire, not barriers

Idiomatic AArch64 code uses LDAR/STLR — the acquire/release semantics pair with the CPU's own store buffer, no separate fence needed. Barriers become rare in modern code.

12

DMB / DSB / ISB — The Barrier Zoo

BarrierScopeSemanticsTypical use
DMB ISHInner ShareableAll cores in cluster see loads/stores in orderLock fence, spinlock release
DMB ISHLDISH, loads onlyLoad→load ordering onlyRead-side of RCU
DMB ISHSTISH, stores onlyStore→store ordering onlyWrite-combining flush
DSB ISHISH, system-wideAll prior mem accesses + caches completeAfter TLBI / IC / DC
DSB OSHOuter ShareableAlso includes devices on outer sideDMA start synchronisation
DSB SYSystemFull system barrierBoot, rare
ISBLocal PEInstruction Synchronisation — context-synchronisingAfter system-register write, after cache maintenance

Rule of thumb: DMB orders memory accesses; DSB also waits for them; ISB flushes the pipeline so subsequent fetches see new state. Nearly every use of these on modern systems can be replaced by LDAR/STLR/CAS.

13

LSE Atomics (Armv8.1-A)

  • Before LSE, atomic ops were a LDXR / STXR loop — "Load-Linked / Store-Conditional". Under contention it can livelock.
  • LSE — Large System Extension — adds single-instruction atomics that scale to many-core.
  • Each takes acquire (A) / release (L) / acquire+release (AL) / plain variants — so ordering is baked in.
  • Under the hood, big SMP systems implement LSE as a near-atomic sent to the Home Node (HN-F) in CHI — the cache-line's "owner" executes the RMW without the requester needing to own the line.
  • Kernel compiles to LSE when present (-moutline-atomics produces a runtime picker).
// LSE atomics — all AArch64-only, Armv8.1-A+

cas    w0, w1, [x2]        // compare-and-swap
casal  w0, w1, [x2]        // CAS acquire+release
swp    w0, w1, [x2]        // SWP: atomic exchange

ldadd  w0, w1, [x2]        // atomic fetch-add
ldaxr  / ldxr             // legacy exclusive load

// Idiomatic acquire-semantics spinlock using CAS
acquire:
    mov   w3, #1
1:  mov   w1, wzr
    casa  w1, w3, [x0]     // CAS acquire: expect 0, set 1
    cbnz  w1, 1b           // retry until free
    ret

release:
    stlr  wzr, [x0]        // release-store clears
    ret
14

Producer / Consumer — Release & Acquire in Practice

// Classic "publish data, then flag" pattern

// Producer (core A):
//   data is filled in, then published
str   x0, [data_ptr]       // 1: write data
stlr  w1, [flag]           // 2: release-store flag=1

// Consumer (core B):
//   spin on flag, then read data
1:  ldar  w2, [flag]       // acquire-load
    cbz   w2, 1b           // loop until non-zero
ldr   x3, [data_ptr]       // safe: sees the pre-STLR write
  • The STLR in step 2 has release semantics: no prior store by this core may be reordered past it.
  • The LDAR in step 1 of the consumer has acquire semantics: no subsequent load may be reordered before it.
  • Together, the STLR-LDAR pair creates a happens-before edge — the consumer is guaranteed to see the producer's earlier STR x0, [data_ptr].
  • Without release/acquire this program is racy on Arm — the consumer can read stale data. On x86 it happens to work because x86 is TSO.
  • C++ memory_order_release / memory_order_acquire maps exactly to STLR / LDAR on AArch64.
15

Prefetch & PRFM

  • PRFM — hint to the caches; no architectural effect.
  • Encoded target × type × locality:
    • Target: PLD (load) / PLI (instruction) / PST (store)
    • Type: L1 / L2 / L3
    • Locality: KEEP / STRM (streaming — hint cache it briefly)
  • Hardware prefetchers also run in parallel — strided, stream, and increasingly Data-Dependent prefetchers in Cortex-X4+.
  • In modern workloads (DB hashing, graph traversal) explicit PRFM still wins by 10-20% even with a good HW prefetcher.
// Linked-list walk with manual prefetch
// Classic cache-miss-hiding pattern

walk:
    ldr   x1, [x0, #NEXT]     // load next
    cbz   x1, 2f
    prfm  pldl1keep, [x1]     // prefetch next->next
    ldr   x2, [x0, #DATA]     // do the work
    bl    process_node
    mov   x0, x1
    b     walk
2:  ret

DC ZVA is effectively a "prefetch-for-write" — it zeroes a cache line in L1 without a fill from DRAM. Used in memset / calloc fast-paths.

16

Related Decks

Key takeaways

  • Weak memory model is the single most-tripped-up topic when reasoning about Arm SMP — know STLR / LDAR / LSE.
  • PoU vs PoC is the JIT & DMA test. Memorise the "clean to PoU, invalidate I to PoU" sequence.
  • Know what ASID does and why nG=0 kernel entries avoid ASID tagging.
  • TTBR0 vs TTBR1 split lets Linux keep a single kernel page-table across all processes.
17

Lessons

  • "Why tagged TLB with ASID?" → avoids flushing user entries on context switch; cut switch cost from hundreds of cycles to near zero.
  • "What's the difference between PoU and PoC?" → PoU is where I and D caches see the same data (for self-modifying code); PoC is where everyone, including DMA masters, sees the same data.
  • "Why does Linux use 4 KB granule mostly?" → x86 compat, small-page granularity for fine-grained mmap. Apple chose 16 KB for better TLB coverage and fewer page-table levels.
  • "Why LSE atomics?" → LDXR/STXR livelocks under heavy contention; LSE sends a single RMW to the cache-line's home node, scaling to 64+ cores.
  • "Explain a spinlock on AArch64" → CAS acquire to take, STLR to release, WFE while parked; avoids contended LL/SC livelock.
  • "When is a DSB needed?" → after TLBI or IC/DC maintenance, to ensure it's globally observable before the ISB; rarely needed for plain data ordering (use release/acquire).
  • "What is TBI?" → Top Byte Ignore — CPU ignores bits [63:56] of a VA on access. Enables MTE tags and HWASAN.
18

References

Arm Ltd.DDI 0487 — VMSA, ordering, cache chapters
Arm Ltd.Learn the architecture: AArch64 memory management — free, well-written walkthrough
Arm Ltd.Cache maintenance application note — by-VA vs set/way operations
McKenney, Paul E.Is Parallel Programming Hard, And, If So, What Can You Do About It? — RCU author, covers Arm memory model
Alglave, Maranget, Tautschnig — "Herding cats: Modelling, simulation, testing, and data-mining for weak memory" (ACM TOPLAS 2014) — Arm memory model formalisation
Linux kernel — Documentation/arm64/memory.rst — layout & ASID notes
Linux kernel — arch/arm64/include/asm/atomic_lse.h — LSE atomic implementations
Preshing, Jeff — preshing.com — excellent memory-ordering intuition pieces

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.