Armv8-A has a weak default, but with optional release-consistency (RCsc) and, from v8.3, RCpc (processor-consistent) loads. Java volatile, C++ seq_cst, and Linux kernel RCU all compile into these primitives.
Dual-stage translation is how KVM/Android pKVM gives guests their own virtual address spaces cheaply — the guest-physical → host-physical mapping is Stage-2, invisible to the guest OS.
| Granule | Typical use | Levels for 48-bit VA | Block sizes available |
|---|---|---|---|
| 4 KB | Linux default; most servers/phones | 4 (L0→L3) | 4 KB page, 2 MB block, 1 GB block |
| 16 KB | Apple iOS/macOS default | 4 (L0→L3) | 16 KB page, 32 MB block |
| 64 KB | Some server workloads, huge-page | 3 (L1→L3) | 64 KB page, 512 MB block |
Each Translation Table level splits 9 bits of VA (for 4 KB) — Level 0..3 use VA[47:39], [38:30], [29:21], [20:12]. 16 KB uses 11-bit indexing per level; 64 KB uses 13-bit indexing.
Instead of walking all the way down to a 4 KB page, an L1 descriptor can directly map a 1 GB block. Linux uses this for hugetlb and for the 1:1 kernel map.
LVA (Large Virtual Address) extends VA to 52 bits via a 5th level (L-1 for 4 KB granule). LPA extends PA to 52 bits with higher-order bits in the descriptor.
// Simplified 4 KB, 48-bit VA walk (4 levels)
VA[47:39] = L0 index (PGD)
VA[38:30] = L1 index (PUD)
VA[29:21] = L2 index (PMD)
VA[20:12] = L3 index (PTE)
VA[11:0] = page offset
PTE descriptor layout (simplified):
[0] valid
[1] table (1) or block/page (0)
[11:2] lower attrs (AP, SH, AttrIndx, AF, nG)
[47:12] output address (PA bits)
[63:51] upper attrs (PBHA, GP, DBM, Contig, PXN, UXN)
// AF = Access Flag: HW can set on access, or
// software-set (older cores). Armv8.1 adds HW update.
Every context-switch would otherwise flush L1 TLB (~48-128 entries). Re-filling that costs hundreds of cycles. ASID turns context switch into a near-free operation for the TLB.
// TLBI maintenance ops (outer sharable, all ELs)
tlbi vmalle1is // all entries at EL1 (local ASID)
tlbi aside1is, x0 // all entries matching ASID in x0
tlbi vae1is, x1 // single VA (encoded in x1)
tlbi vmalls12e1is // all Stage-1+2 for this VMID
dsb ish // wait for it
isb // restart pipeline fetch
| Type | Cacheable | Reorder? | Speculation? | Use |
|---|---|---|---|---|
| Normal | Yes (with policy) | Yes (weak) | Yes | DRAM, cacheable regions |
| Device-GRE | No | Gathered, reordered, early-ack | No | High-throughput I/O (graphics buffers) |
| Device-nGRE | No | Reordered, early-ack | No | Less permissive I/O |
| Device-nGnRE | No | Early-ack only | No | Typical peripherals (GIC, UART) |
| Device-nGnRnE | No | None | No | Strictly-ordered — legacy "strongly-ordered" |
"nG" = non-Gathering, "nR" = non-Reordering, "nE" = no Early write-ack. "Normal" memory also has an Inner & Outer cache policy (WB-WA, WT, NC). These are all encoded in an 8-bit MAIR_ELx slot, and the PTE picks a slot by 3-bit AttrIndx.
MAIR indirection means the attribute table can be rewritten once and all page-table entries immediately pick up the new meaning — very useful for quickly switching between write-through and write-back without touching every PTE.
// JIT compiler flushes after writing code
// (write-then-execute pattern)
bl generate_code // builds into [x0 .. x0+len]
mov x2, x0 // save base
1: dc cvau, x2 // clean D to PoU
add x2, x2, #64
cmp x2, x3
b.lo 1b
dsb ish // wait for cleans
ic ivau, x0 // invalidate I to PoU (by VA)
dsb ish
isb // restart fetch
br x0 // now safe to execute
This sequence is what every JIT (V8, JSC, JVM) does on AArch64. Armv8.1+ cores can speed it up via CTR_EL0.DIC/IDC — if set, parts of the sequence are redundant.
| Instruction | What it does | Typical use |
|---|---|---|
| IC IALLU | Invalidate all I-cache (local PE) | After writing code pages |
| IC IVAU, Xt | Invalidate I-cache by VA to PoU | JIT flush (see prev slide) |
| DC IVAC, Xt | Invalidate D-cache by VA to PoC | Before DMA from device → memory |
| DC CVAC, Xt | Clean D-cache by VA to PoC | Before DMA from memory → device |
| DC CIVAC, Xt | Clean & invalidate by VA to PoC | Bidirectional DMA buffers |
| DC CVAU, Xt | Clean by VA to PoU | JIT flush pair with IC IVAU |
| DC ZVA, Xt | Zero a cache line (no allocate) | memset fast-path (replaces STR #0 loop) |
| DC CIVAP, Xt | Clean & invalidate to PoPS | NVDIMM flush (Armv8.2) |
Set/way operations (DC ISW / CSW / CISW) still exist for bring-up but are discouraged — they are not broadcast to other PEs. For runtime flushing, always use by-VA ops.
Two cores: P0 does LDR r1, [A]; STR r2, [B]; P1 does LDR r3, [B]; STR r4, [A]. Can both loads see initial zero and both stores succeed? On x86: no. On Arm: yes — loads can be reordered with later stores.
Idiomatic AArch64 code uses LDAR/STLR — the acquire/release semantics pair with the CPU's own store buffer, no separate fence needed. Barriers become rare in modern code.
| Barrier | Scope | Semantics | Typical use |
|---|---|---|---|
| DMB ISH | Inner Shareable | All cores in cluster see loads/stores in order | Lock fence, spinlock release |
| DMB ISHLD | ISH, loads only | Load→load ordering only | Read-side of RCU |
| DMB ISHST | ISH, stores only | Store→store ordering only | Write-combining flush |
| DSB ISH | ISH, system-wide | All prior mem accesses + caches complete | After TLBI / IC / DC |
| DSB OSH | Outer Shareable | Also includes devices on outer side | DMA start synchronisation |
| DSB SY | System | Full system barrier | Boot, rare |
| ISB | Local PE | Instruction Synchronisation — context-synchronising | After system-register write, after cache maintenance |
Rule of thumb: DMB orders memory accesses; DSB also waits for them; ISB flushes the pipeline so subsequent fetches see new state. Nearly every use of these on modern systems can be replaced by LDAR/STLR/CAS.
// LSE atomics — all AArch64-only, Armv8.1-A+
cas w0, w1, [x2] // compare-and-swap
casal w0, w1, [x2] // CAS acquire+release
swp w0, w1, [x2] // SWP: atomic exchange
ldadd w0, w1, [x2] // atomic fetch-add
ldaxr / ldxr // legacy exclusive load
// Idiomatic acquire-semantics spinlock using CAS
acquire:
mov w3, #1
1: mov w1, wzr
casa w1, w3, [x0] // CAS acquire: expect 0, set 1
cbnz w1, 1b // retry until free
ret
release:
stlr wzr, [x0] // release-store clears
ret
// Classic "publish data, then flag" pattern
// Producer (core A):
// data is filled in, then published
str x0, [data_ptr] // 1: write data
stlr w1, [flag] // 2: release-store flag=1
// Consumer (core B):
// spin on flag, then read data
1: ldar w2, [flag] // acquire-load
cbz w2, 1b // loop until non-zero
ldr x3, [data_ptr] // safe: sees the pre-STLR write
// Linked-list walk with manual prefetch
// Classic cache-miss-hiding pattern
walk:
ldr x1, [x0, #NEXT] // load next
cbz x1, 2f
prfm pldl1keep, [x1] // prefetch next->next
ldr x2, [x0, #DATA] // do the work
bl process_node
mov x0, x1
b walk
2: ret
DC ZVA is effectively a "prefetch-for-write" — it zeroes a cache line in L1 without a fill from DRAM. Used in memset / calloc fast-paths.
Arm Ltd. — DDI 0487 — VMSA, ordering, cache chapters
Arm Ltd. — Learn the architecture: AArch64 memory management — free, well-written walkthrough
Arm Ltd. — Cache maintenance application note — by-VA vs set/way operations
McKenney, Paul E. — Is Parallel Programming Hard, And, If So, What Can You Do About It? — RCU author, covers Arm memory model
Alglave, Maranget, Tautschnig — "Herding cats: Modelling, simulation, testing, and data-mining for weak memory" (ACM TOPLAS 2014) — Arm memory model formalisation
Linux kernel — Documentation/arm64/memory.rst — layout & ASID notes
Linux kernel — arch/arm64/include/asm/atomic_lse.h — LSE atomic implementations
Preshing, Jeff — preshing.com — excellent memory-ordering intuition pieces
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.