ARM CORTEX-M · PRESENTATION 04

Memory System & MPU

Memory Map · Bit-Banding · Barriers · MPU · Caches · TCM

No MMU. No virtual memory. Just a carefully-partitioned 4 GiB of physical address space.

Why a Fixed Memory Map?

Every Cortex-M sees the same 4 GiB map. The architecture fixes eight regions, each with default attributes and semantics.
A vendor SoC maps its flash, SRAM, peripherals and external memories inside these regions — but the region boundaries and defaults are architectural.
The compiler / linker can therefore assume memory semantics (e.g. that 0x4000_0000 is Device memory) without device-specific macros.
The NVIC, SCB, MPU, FPU, SysTick, and CoreSight components live at fixed addresses inside the Private Peripheral Bus (PPB).

Consequence

If you port an RTOS port from STM32 to Nordic to NXP, the address of SCB->VTOR is identical. The peripheral offsets change; the core does not.

Consequence

Attributes like "Normal" vs "Device" memory are typed by the region (default), but the MPU can override them on a region-by-region basis.

The Eight Regions

0xFFFF_FFFF

Vendor (reserved)

0xE010_0000

Private Peripheral Bus (PPB) — NVIC · SCB · SysTick · FPU · MPU · CoreSight (Strongly-Ordered)

0xE000_0000

External Device — off-chip devices: eMMC controllers etc. (Device memory)

0xA000_0000

External RAM — off-chip SDRAM / HyperRAM / QSPI-XIP (Normal memory, cacheable)

0x6000_0000

Peripheral — on-chip peripherals, DMAs, GPIO, AHB/APB blocks (Device memory)

0x4000_0000

SRAM — on-chip SRAM; bit-band alias at 0x2200_0000 (Normal memory)

0x2000_0000

Code — flash / ROM; vector table at reset (Normal memory, executable)

0x0000_0000

▲

Each region is 512 MiB. The PPB is Strongly-Ordered — every access must complete in program order; no speculative fetch, no write-buffer.

Code Region — 0x0000_0000

Vector table must be reachable at address 0 after reset.
Most SoCs "alias" flash to 0x0000_0000 (e.g. STM32 BOOT0 pin selects flash vs system-memory vs SRAM).
Normal memory semantics — cacheable, bufferable, executable.
Instructions can also be fetched from the SRAM region, but performance on M3/M4 is lower because of the bus matrix design — M7 mitigates this via I-cache + AXI.

Layout inside Code region (vendor pattern)

0x0000_0000   (boot alias — remapped by OPTR)
0x0800_0000   main flash — STM32
0x1FFF_0000   system memory (DfU bootloader)
0x1FFF_7800   OTP / option bytes
0x1FFF_F800   engineering bytes

Exact map depends entirely on the vendor. The architecture only guarantees reset reads from 0x0000_0000/4.

SRAM Region — 0x2000_0000

Canonical home for stack, heap, .data, .bss.
Normal memory — cacheable (on cored MCUs), bufferable, non-executable unless XN disabled (rarely).
First 1 MiB (0x2000_0000 – 0x200F_FFFF) is the bit-band region on M3/M4 — single-bit atomic access via the alias.
Parity / ECC on many SoCs — a soft error on a word you never touched can still faultingly propagate via speculative load on M7.

Linker placement

MEMORY {
  FLASH (rx)  : ORIGIN = 0x08000000, LENGTH = 1M
  SRAM  (rwx) : ORIGIN = 0x20000000, LENGTH = 192K
  CCM   (rw)  : ORIGIN = 0x10000000, LENGTH = 64K
}
SECTIONS {
  .text  : { *(.isr_vector) *(.text*) *(.rodata*) } >FLASH
  .data  : { *(.data*) } >SRAM AT>FLASH
  .bss   : { *(.bss*)  } >SRAM
  .ccm   : { *(.ccm*)  } >CCM
}

Peripheral Region — 0x4000_0000

Default attributes: Device memory — no speculation, no merging, ordering preserved between accesses to the same peripheral.
Execute-never by default (XN): even if MPU is off, CPU refuses to fetch instructions from this region.
First 1 MiB has a bit-band alias at 0x4200_0000 (M3/M4 only).
All the GPIO, timers, UART, SPI, I²C, DMA etc. on the chip live here — so does the on-chip ADC / USB / Ethernet MAC.

Device-memory semantics

Consecutive writes to a peripheral register are guaranteed to reach it in order, but may still be posted by the bus. Reads may be stalled to the peripheral. No coalescing: two str instructions generate two bus transactions.

Hazard: "write-and-forget" semantics mean a str that raises a bus fault may not fault precisely — see imprecise bus fault in presentation 02.

PPB — Private Peripheral Bus

Address	Block	Role
0xE000_0000	ITM	Instrumentation Trace Macrocell (printf over SWO)
0xE000_1000	DWT	Data Watchpoint & Trace (cycle counter, watchpoints)
0xE000_2000	FPB	Flash Patch & Breakpoint
0xE000_E000	SCS	System Control Space — NVIC, SCB, MPU, SysTick, FPU
0xE004_0000	TPIU	Trace Port Interface Unit
0xE004_1000	ETM	Embedded Trace Macrocell (optional)
0xE00F_0000	ROM tables	CoreSight component enumeration

Strongly-Ordered

Every PPB access completes before the next begins. No speculation. No reordering. No write buffer. Good for configuring NVIC / MPU safely but means PPB writes are never fast.

Privileged only — PPB addresses respond RAZ/WI to unprivileged accesses (read-as-zero, write-ignored), except the ITM stimulus ports and FPU if so configured.

Bit-Banding — Concept

On Cortex-M3 / M4, a bit-band alias region lets each bit of a 32-bit word in bit-band memory be addressed as its own 32-bit word in the alias.

Read from alias → returns 0 or 1 (padded to 32 bits).
Write to alias → atomically sets or clears the corresponding bit.
The bus translates the access — no RMW cycle in CPU, no IRQ masking needed.
Dropped from Cortex-M7 and all Armv8-M cores — use exclusive load/store or atomic bit-set/clear peripherals instead.

Addressing math

alias = band_base + (word_offset << 5)
                  + (bit_offset  << 2)

/* SRAM band:   0x2000_0000 … 0x200F_FFFF (1 MiB)
   SRAM alias:  0x2200_0000 … 0x23FF_FFFF (32 MiB) */

#define BIT_BAND_SRAM(byte, bit)   \
    (*(volatile uint32_t *)        \
       (0x22000000u +               \
        ((uint32_t)(byte) - 0x20000000u) * 32 + \
        (bit) * 4))

Every bit in the 1 MiB band has exactly one 4-byte alias — so 8 MiB of bit × 4 bytes = 32 MiB alias.

Bit-Banding — Atomic RMW for Free

Without bit-banding

; set bit 5 of *p atomically
disable_irq:     cpsid i
                 ldr   r1, [r0]
                 orr   r1, r1, #0x20
                 str   r1, [r0]
enable_irq:      cpsie i

4 instructions + IRQ mask window.

With bit-banding

; set bit 5 of address p via alias
ldr  r0, =0x22000000 + ((p-0x20000000)*32) + (5*4)
movs r1, #1
str  r1, [r0]

1 store. No IRQ mask. The bus atomically performs the RMW on CPU's behalf.

Killer use-case (back in the day): toggling a GPIO pin from an ISR, setting/clearing a flag shared between a timer ISR and mainline code, incrementing a 1-bit semaphore — all truly atomic without touching PRIMASK. The feature was quietly dropped on M7+ because caches make the alias semantics hard.

Memory Types

Type	Speculative fetch	Merging writes	Ordering	Typical use
Normal	Yes	Yes (within a beat)	Weak — CPU may reorder	Code, SRAM, cached external RAM
Device	No	No (each access separate)	Preserved within same type	Peripheral registers
Strongly-Ordered	No	No	Preserved globally; each access completes before next	PPB / NVIC / MPU / CoreSight

Normal memory attributes

Cacheable — may live in I-cache / D-cache (M7/M55/M85).
Bufferable — writes may be posted in the write buffer.
Shareable — shared with other masters (DMA, second core, system DMA).

Why "Device" exists

A peripheral register might have side-effects on read. Two reads must produce two bus cycles; the CPU cannot coalesce or speculate.

Memory Barriers

DMB

Data Memory Barrier. All explicit memory accesses before DMB are observed before any after. Does not flush the pipeline or wait for completion — only ordering.

DSB

Data Synchronization Barrier. All explicit accesses before DSB complete before any instruction after DSB executes. Blocks until stores are drained.

ISB

Instruction Synchronization Barrier. Flushes the pipeline; subsequent instructions are refetched. Needed after changing system state that affects instruction fetch (VTOR, MPU, priv level).

When DMB is enough

Ordering between two software threads or between code and a DMA that shares Normal memory.

msg.payload = 0xDEADBEEF;
__DMB();
msg.ready   = 1;              /* consumer reads ready then payload */

When DSB is needed

Before code that assumes a side-effect has taken hold — e.g. starting a DMA after programming its registers; kicking the watchdog.

DMA1->CHENA |= 1;
__DSB();                      /* ensure reg write hit the peripheral */

Barriers — When & Why

Situation	Barrier	Why
Writing `SCB->VTOR` then enabling IRQs	DSB + ISB	Later vector fetches use the new table
Enabling MPU / updating a region	DSB + ISB	Later fetches see new permissions
Changing CONTROL (switch to PSP / unprivileged)	ISB	Pipeline flush: instructions already in flight used the old mode
Self-modifying code (e.g. flash programming)	DSB + ISB + cache invalidate	Ensure cache coherency and pipeline reload
Configuring DMA registers then enabling	DSB	Make sure writes hit the DMA before enable
Clearing NVIC pending then enabling IRQ	DSB (optional)	Belt-and-braces; usually PPB writes are strongly-ordered anyway
Writing to flash configuration bytes	DSB	Ensure store drained before issuing program command
Clearing FPU CONTROL.FPCA	DSB + ISB	Avoid lazy-stacking ambiguity

MPU — v7-M PMSA

Memory Protection Unit — a region-based permission checker (no translation, no virtualisation).
8 regions (optionally 16 on some M3/M4) plus one "background" region for privileged fall-through.
Each region must be power-of-two size, between 32 B and 4 GiB.
Must be naturally aligned (base address must be a multiple of size).
Higher-numbered region wins when they overlap.

Per-region attributes

Base address (RBAR).
Size (RASR.SIZE = log₂(size) − 1).
Access permission: no / RW-privileged / RW-unpriv / RO.
Type attrs: cacheable, bufferable, shareable, TEX encoding.
Execute-never bit.
Sub-region disable (SRD) — an 8-bit field that carves the region into 8 equal slices; set bit → that slice is disabled.

v7-M MPU — Typical Setup

#include "core_cm4.h"

static void mpu_setup(void)
{
    __DMB();
    MPU->CTRL = 0;                 /* disable while editing */

    /* Region 0: entire Code region, RO, executable, normal cacheable */
    MPU->RNR  = 0;
    MPU->RBAR = 0x00000000UL;
    MPU->RASR = (0  << 28) |       /* XN=0, executable */
                (6  << 24) |       /* AP = RO priv+unpriv */
                (0  << 19) |       /* TEX=0 */
                (1  << 18) |       /* S=1  shareable */
                (1  << 17) |       /* C=1  cacheable */
                (1  << 16) |       /* B=1  bufferable */
                (0  << 8)  |       /* SRD = 0 */
                ((32-1) << 1) |    /* SIZE = 2^32 */
                (1  << 0);         /* ENABLE */

    /* Region 1: SRAM RW priv, RO unpriv */
    MPU->RNR  = 1;
    MPU->RBAR = 0x20000000UL;
    MPU->RASR = (1  << 28) |       /* XN=1 no exec */
                (2  << 24) |       /* AP priv RW, unpriv RO */
                (1  << 18) | (1<<17) | (1<<16) |
                ((18-1) << 1) |    /* SIZE = 256 KiB */
                1;

    MPU->CTRL = MPU_CTRL_PRIVDEFENA_Msk |    /* default map for priv */
                MPU_CTRL_ENABLE_Msk;
    __DSB(); __ISB();
}

After enabling, any unprivileged write to 0x2000_0000+ fires a MemManage fault — kernel stays in control.

MPU — v8-M PMSAv8

What's different

Base + Limit addressing (not base + power-of-two size). Any 32-byte granularity.
Attribute-indirect: region carries a 3-bit AttrIdx pointing into a shared MPU_MAIR pair of 32-bit registers (à la A-profile MAIR).
No sub-region disable — arbitrary limits make it unnecessary.
Separate banks in TrustZone: MPU_S for Secure, MPU_NS for Non-Secure.
8 or 16 regions configurable at implementation.

/* Set attr 0 = Normal WB inner+outer cacheable */
MPU->MAIR0 = (0xFFu << 0);

/* Region 0: flash 0x0800_0000 .. 0x082F_FFFF */
MPU->RNR   = 0;
MPU->RBAR  = 0x08000000 | (0u<<3) /* AttrIdx */
                         | (5u<<1) /* AP = priv RO, unpriv RO */
                         | 0;       /* XN=0, executable */
MPU->RLAR  = 0x082FFFE0 | 1;       /* limit, ENABLE=1 */

MPU->CTRL = 1 | (1<<2);            /* ENABLE + PRIVDEFENA */

Cleaner & more orthogonal than v7-M MPU. RTOSes target both via a single abstraction (e.g. FreeRTOS MPU port).

MPU in RTOS Context

FreeRTOS & Zephyr have MPU-enforced "user-mode" tasks (xTaskCreateRestricted, Zephyr user threads) that use this split.

Caches — M7 / M55 / M85

Cortex-M7

Optional L1 I-cache + D-cache, configurable 4–64 KiB each, 2-way or 4-way set-assoc.
Write-back or write-through per region (via MPU attrs).
Caches are not coherent with DMA — software managed.
Separate SCB_InvalidateICache, SCB_CleanDCache, SCB_InvalidateDCache operations in CMSIS.

M55 / M85

Optional, up to 16 / 32 KiB I+D cache (impl-defined upper bound on M85).
Added MSCR bits for easier cache maintenance from C.

Why not coherent?

The bus IP choices (AXI, AHB-5) on MCUs don't implement a full snooping protocol. Full MOESI/MESI would cost area and power that an MCU class cannot afford.

Consequence

DMA buffers must be placed in non-cacheable memory (via MPU attrs on the region, or in a bypass region) or the code must clean before sending / invalidate before receiving.

DMA + Cache — The Coherence Recipe

CPU → peripheral (TX)

memcpy(tx_buf, src, n);

/* push dirty lines out before DMA reads */
SCB_CleanDCache_by_Addr((uint32_t*)tx_buf, n);

dma_start(tx_buf, n);

Peripheral → CPU (RX)

/* make sure we don't have stale lines */
SCB_InvalidateDCache_by_Addr((uint32_t*)rx_buf, n);

dma_start(rx_buf, n);
wait_for_dma_done();

/* after DMA writes, invalidate again so CPU reads from RAM */
SCB_InvalidateDCache_by_Addr((uint32_t*)rx_buf, n);
/* buffer now safe to use from CPU */

Alignment: rx_buf must be 32-byte aligned and sized in multiples of 32 B on M7 — Invalidate operates by cache line, and partial lines would clobber adjacent variables. Use __attribute__((aligned(32))) + round-up the length.

Cache Maintenance — Granular vs Full

Operation	Cost	When
`SCB_EnableDCache()`	one-time	Boot. Invalidates whole cache.
`SCB_InvalidateDCache()`	full-cache flush	Rare — after a big external memory change, before first DMA.
`SCB_InvalidateDCache_by_Addr(p,n)`	~n/32 cycles	Before every DMA receive into a cacheable buffer.
`SCB_CleanDCache_by_Addr(p,n)`	~n/32 cycles	Before every DMA transmit from a cacheable buffer.
`SCB_CleanInvalidateDCache_by_Addr(p,n)`	~n/32 cycles	Bidirectional DMA (rare).
`SCB_InvalidateICache()`	full I-flush	After self-modifying code or flash-patched code region.

Shortcut for DMA-heavy designs: allocate DMA buffers in a dedicated non-cacheable MPU region (set C=0). The maintenance code goes away — at the cost of slower CPU access to those buffers. Typical trade-off on M7 audio/USB.

Tightly-Coupled Memory (TCM)

What it is

SRAM physically attached to the CPU's instruction and data buses, outside the main bus matrix.
ITCM at low address in Code region (e.g. 0x0000_0000 on STM32H7 when BOOT_ADD0 selects it).
DTCM at low address in SRAM region (0x2000_0000 on STM32H7, 128 KiB).
Single-cycle access, no wait-states, no cache involvement.
Typical size: 16–128 KiB per TCM. Visible only to one master (the core).

What to put there

ISRs that must start in < 20 ns regardless of cache state (audio DSP, motor control PWM update).
Tightest FIR/IIR kernel data.
The exception stack — bounded, predictable.

Trap: DMA cannot access TCM on most M7 SoCs (TCM is on the CPU's private side of the bus matrix). Double-buffer into AXI SRAM, or use the "shared" variant if the vendor provides one.

Typical Memory Layout — Cortex-M7 SoC

Endianness Revisited

Data loads/stores obey the CPU's configured endianness.
Instruction fetches are always little-endian regardless (instructions are bytes in ROM written out by the linker).
Byte-swap intrinsics map to single-cycle instructions: __REV, __REV16, __REVSH, __RBIT.
Network-byte-order conversion is therefore free compared to 8051-era MCUs where bswap takes 4 instructions.

/* 'be32toh' on little-endian Cortex-M */
static inline uint32_t be32toh(uint32_t v)
{
    uint32_t out;
    __asm volatile ("rev %0, %1" : "=r"(out) : "r"(v));
    return out;
}

/* Compiler already emits REV for __builtin_bswap32 */
uint32_t net = __builtin_bswap32(host);

Stack-Limit Registers (v8-M)

MSPLIM, PSPLIM — 32-bit, 8-byte aligned lower-bound for each stack.
Hardware checks on every push / stack-frame write: if SP < SPLIM → UsageFault with UFSR.STKOF.
Completely replaces the old guard-page hack (unmapped region at the bottom of the stack enforced by MPU).
Banked per security state on TrustZone cores: MSPLIM_S, PSPLIM_S, MSPLIM_NS, PSPLIM_NS.

/* FreeRTOS v8-M port snippet */
void vPortSetupTaskStack(StackType_t *top, StackType_t *bottom)
{
    __set_PSPLIM((uint32_t)bottom);
}

/* Stack overflow → UsageFault at first push past limit */
void UsageFault_Handler(void)
{
    if (SCB->CFSR & SCB_CFSR_STKOF_Msk) {
        /* current task smashed its stack — reschedule */
        configASSERT(!"stack overflow");
    }
    for (;;);
}

Memory Design Patterns

DMA in non-cacheable SRAM

Dedicate a small MPU region or a separate SRAM bank with C=B=0. All DMA descriptors and buffers live there. No cache maintenance ever needed.

Critical code in TCM

__attribute__((section(".itcm"))) for the fast path of the motor-control PWM ISR & the FIR kernel. Deterministic 1-cycle fetch; no I-cache jitter.

Split .rodata for XIP

Large LUTs (sine tables, glyphs) in QSPI XIP region (0x9000_0000). Hot data in internal flash / AXI-SRAM. Linker script manages placement.

Kernel-only region

MPU region 0 covers SRAM top-slice RW priv, no unpriv. RTOS puts control blocks there; user tasks can never corrupt them.

Scratchpad ping-pong

Two buffers aligned to 32 B. DMA-receive into A while CPU processes B; on done, swap and invalidate D-cache for the newly-filled one.

ECC-aware .bss

Zero-init every word at boot to avoid ECC "uninitialised" traps on the first read — even on variables nominally uninitialised.

References

Arm — Armv7-M / Armv8-M Architecture Reference Manual — Chapter B3 (Memory model)
Arm — Cortex-M7 Technical Reference Manual (Section on cache maintenance, TCM)
Arm — Cortex-M Memory Protection Unit Programming Guide (Application note 321)
Joseph Yiu — Definitive Guide to Cortex-M3/M4, chapters on MPU, memory, bit-banding
STM32H7 Programming Manual (PM0253) — MPU, cache maintenance, TCM examples
ARM Community — "Cortex-M7 Cache maintenance for DMA" app note
Jean Labrosse — μC/OS-III Memory Management chapter — MPU + RTOS integration

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.