ARM CORTEX-M · PRESENTATION 02

Architecture & Programmer's Model

The Cortex-M Family · Armv6-M / v7-M / v8-M · Thumb-2 · Registers & Modes

M0 · M0+ · M3 · M4 · M7 · M23 · M33 · M35P · M52 · M55 · M85

Navigate: → ← | Overview: Esc | Fullscreen: F

Why Cortex-M Matters

The default 32-bit MCU ISA. Virtually every major MCU vendor (ST, NXP, Nordic, Silicon Labs, Renesas, Microchip, Infineon, TI, Ambiq, GigaDevice, Raspberry Pi) ships Cortex-M silicon.
Volume. Arm-based microcontrollers ship at tens of billions of units per year; Cortex-M is the largest slice.
Span. One family from a $0.10 Cortex-M0 to a 600 MHz cache-equipped Cortex-M85 — same toolchain, same CMSIS headers.
Ubiquity. Also lives inside bigger chips: as a secure enclave (Cortex-M33), a wake-up controller, or a peripheral offload engine in many A-profile SoCs.

Interview context

Cortex-M is the one architecture an embedded-firmware engineer is almost guaranteed to be tested on. Expect questions on the exception model, memory map, NVIC, MPU, and increasingly TrustZone.

"The M-profile is the only 32-bit architecture designed from the ground up for deterministic interrupt latency and bit-level peripheral control — everything else is a general-purpose CPU with interrupts bolted on."
— Paraphrasing Joseph Yiu's design philosophy

Arm Profiles: A, R, M

Profile	Target	Memory	OS	Examples
A Application	Smartphones, servers, laptops	MMU with virtual memory (Sv39 etc.)	Linux, Android, Windows, iOS	Cortex-A78, Neoverse N2, Apple M-series
R Real-Time	Automotive, baseband, storage	MPU (protection, no virt)	AUTOSAR, QNX, bare-metal	Cortex-R52, R82
M Microcontroller	Deeply embedded, IoT, sensors	Optional MPU · physical addresses only	FreeRTOS, Zephyr, bare-metal	Cortex-M0+, M4, M33, M85

What makes M-profile different

Thumb-only — no A32/A64 mode; the core starts in Thumb state and stays there.
NVIC is architectural, not an SoC add-on. Every Cortex-M has one; every C compiler knows about it.
Exception entry in hardware: automatic stacking of the caller-saved register set.
No virtual memory. Addresses go directly to the bus.

What M-profile does not have

No A64 (AArch64) instructions — 32-bit only.
No virtualisation extensions (EL2).
No SMP coherency protocol. Multi-core Cortex-M is asymmetric.
Caches are optional and only appear on M7/M55/M85.

The Cortex-M Family Timeline

Dates indicate initial announcement. v8.1-M (Helium/MVE) layered on top of v8-M Mainline adds the vector extension used by M52 · M55 · M85.

Choose a Core — Interactive

Click a core to see its headline features. Same silicon IP block, vastly different trade-offs.

Cortex-M0

Cortex-M0+

Cortex-M3

Cortex-M4

Cortex-M7

Cortex-M23

Cortex-M33

Cortex-M35P

Cortex-M52

Cortex-M55

Cortex-M85

Armv8.1-M Mainline · Helium (MVE) · 7-stage superscalar · PACBTI · Cache · TrustZone

Arm's current flagship M-profile — ~6.3 CoreMark/MHz, ~4 DMIPS/MHz.
Dual-issue in-order pipeline with branch prediction.
Full Helium (MVE) vector unit — int8/int16/int32/FP16/FP32.
Optional PACBTI (pointer auth + branch-target identification) for CFI.
Target: ML-on-MCU, high-end real-time control, audio DSP.

Cortex-M Lineup at a Glance

Core	Arch	Pipeline	DMIPS/MHz	DSP	FPU	Helium	TrustZone	Cache
M0	v6-M	3-stage	0.9	—	—	—	—	—
M0+	v6-M	2-stage	0.95	—	—	—	—	—
M3	v7-M	3-stage	1.25	—	—	—	—	—
M4	v7E-M	3-stage	1.25	✓	FPv4-SP (opt)	—	—	—
M7	v7E-M	6-stage dual-issue	2.14	✓	FPv5 SP/DP (opt)	—	—	L1 I+D
M23	v8-M Base	2-stage	0.99	—	—	—	✓	—
M33	v8-M Main	3-stage	1.5	opt	FPv5-SP (opt)	—	✓	—
M35P	v8-M Main	3-stage	1.5	opt	FPv5-SP (opt)	—	✓	—
M52	v8.1-M	4-stage	1.6	✓	FPv5 SP (opt)	✓ (int+FP)	✓	opt
M55	v8.1-M	4-stage	1.7	✓	FPv5 SP (opt)	✓ (int+FP)	✓	opt I+D
M85	v8.1-M	7-stage dual-issue	4.0	✓	FPv5 SP/DP (opt)	✓ (int+FP)	✓	opt I+D

Headline numbers are upper-bound estimates from Arm's own benchmarking. Real silicon depends on memory latency, wait states, and which options the SoC vendor licensed.

Armv6-M — The Minimal M-Profile

Instruction set

Thumb subset: ~56 instructions.
No integer divide — software emulation.
No bit-field ops, no saturating arithmetic.
32×32→32 multiply (1-cycle or 32-cycle by option).
IT block present but limited (only 32-bit BL, SVC, MRS/MSR, DSB/DMB/ISB are 32-bit encodings).

System features

No bit-banding.
No BASEPRI — only PRIMASK for IRQ masking.
Max 32 external IRQs, 4 priority levels.
MPU optional on M0+ (8 regions) — not on M0/M1.
Single vector table; no VTOR on M0 (relocation via remap-to-SRAM trick on some SoCs); VTOR on M0+.

Why it still matters: v6-M is cheap enough to drop in as a wake-up/boot controller on top of something far bigger. It is also the ISA targeted by dozens of Arm-compatible open clones.

Armv7-M / v7E-M — Full Thumb-2

v7-M (M3)

Full Thumb-2 (~200 instructions).
Hardware divide (SDIV/UDIV, 2–12 cycles).
Bit-field: BFI, BFC, UBFX, SBFX.
Exclusive access: LDREX/STREX (word, half, byte).
Bit-banding, NVIC, full MPU option.
Up to 240 IRQs, 8-bit priority field (implementations expose 3–8 bits).

v7E-M adds (M4, M7)

DSP extension: SSAT/USAT, QADD/QSUB, QADD8/QADD16, SMLAD, SMLAL, SMMUL etc.
Packed-operand SIMD on 32-bit registers (2×16-bit or 4×8-bit lanes).
Optional single-precision FPU (FPv4-SP on M4, FPv5-SP or FPv5-DP on M7).

The DSP extension alone is a major interview topic for any DSP-adjacent role — see presentation 04.

Armv8-M — TrustZone Arrives

v8-M Baseline (M23)

Armv6-M + TrustZone (the "hardware security" part).
Adds SG, BXNS, BLXNS, MOVW/MOVT, hardware divide.
Keeps the small area footprint of M0+/M23.

v8-M Mainline (M33, M35P)

Superset of v7-M + TrustZone.
Introduces stack-limit registers (MSPLIM, PSPLIM) — HW-enforced stack overflow detection.
Co-processor interface (ACI) for vendor-specific instructions.
Separate S/NS banks for SP, PSPLIM, MSPLIM, CONTROL, FAULTMASK, BASEPRI, PRIMASK.

Armv8.1-M (M52/M55/M85) — layered on top of v8-M Mainline. Adds Helium (MVE), low-overhead loops (LO Branch extension), custom-instruction framework, optional PACBTI. See presentation 04.

Thumb-2 — Mixed 16/32-bit Encoding

The idea

Cortex-M runs only in Thumb state — no A32 (ARM state), no A64.
Most common instructions encode in 16 bits → excellent code density.
Less-common or wider-immediate variants use a 32-bit encoding, distinguishable by the top-5 bits of the half-word.
Result: ≈ ARM 32-bit performance at ≈ ARM code size divided by 1.3.

Code density matters more than clock speed when your flash is 64 KB and costs $0.02/KB.

Encoding at a glance

; 16-bit encodings (common case)
  MOV   r0, #42        ; 0x202A
  ADD   r0, r1         ; 0x1840
  LDR   r0, [r1, #4]   ; 0x6848

; 32-bit encodings (wider imm, etc.)
  MOVW  r0, #0x1234    ; F240 0234
  BL    some_func      ; F7FF FFFE
  UDIV  r0, r1, r2     ; FBB1 F0F2
  SMLAL r0,r1,r2,r3    ; FBC2 0103

Decoder inspects the first half-word: bits[15:11] = 11101, 11110, or 11111 → 32-bit instruction.

The IT (If-Then) Block

Arm A32 condition codes on every instruction — Thumb-2 instead uses a lightweight IT instruction to predicate the next 1–4 instructions.

; if (r0 < r1) r2 = r0; else r2 = r1;
  CMP   r0, r1
  ITE   LT          ; If Then Else
  MOVLT r2, r0      ; IF branch
  MOVGE r2, r1      ; ELSE branch

Encodes "T" and "E" for up to 4 following instructions.
Avoids short branches; predictable timing → good for ISRs.

Interview gotchas

An IT block is architectural — the CPU tracks ITSTATE in EPSR. You cannot branch into the middle of one.
An exception taken inside an IT block is fine — hardware saves xPSR (which includes ITSTATE) on stack and restores it on return.
v8-M Mainline deprecates IT blocks of more than one instruction; v8.1-M adds branch-future alternatives for the same purpose.

Core Registers — R0 to R15

R0 – R3

Argument / scratch. Caller-saved. Auto-stacked on exception.

R4 – R11

Variable registers. Callee-saved.

R12 (IP)

Intra-procedure scratch. Used by linker veneers. Auto-stacked.

R13 (SP)

Stack pointer — banked MSP / PSP (and on v8-M, S/NS variants).

R14 (LR)

Link register. On exception: holds EXC_RETURN magic value.

R15 (PC)

Program counter. Always even (Thumb bit live in instruction address LSB).

AAPCS calling convention

Args in R0–R3, then stack.
Return value in R0 (or R0:R1 for 64-bit).
R4–R11, LR preserved across calls.
Stack 8-byte aligned at public interfaces.

The auto-stacked set {R0-R3, R12, LR, PC, xPSR} = exactly the caller-saved set + return state. Hardware is effectively doing the function-prologue save for the ISR.

xPSR — The Program Status Register

Three logical views of one physical register: APSR (flags), EPSR (execution state), IPSR (active exception number). Separate MRS/MSR instructions read/write each subset. On exception entry, the whole xPSR is pushed as the 8th auto-stacked word.

Special Registers

Reg	Width	Purpose	Access
PRIMASK	1 bit	Master IRQ disable. `cpsid i` sets; `cpsie i` clears. Blocks all exceptions except NMI and HardFault.	Privileged
FAULTMASK	1 bit	Like PRIMASK but also blocks HardFault. Auto-cleared on return from handler.	Privileged · not on v6-M
BASEPRI	8 bit	Block all IRQs of numerical priority ≥ BASEPRI (lower number = higher priority). Finer-grained than PRIMASK.	Privileged · not on v6-M
CONTROL	3 bit	Bit 0 (nPRIV): Thread mode privilege; 1 = unprivileged. Bit 1 (SPSEL): stack in use; 0 = MSP, 1 = PSP (in Thread mode). Bit 2 (FPCA): FPU context active — controls lazy stacking.	Privileged (writing)
MSPLIM / PSPLIM	32 bit	Stack-pointer lower limit. Hardware UsageFault on stack descent past limit.	Privileged · v8-M Mainline only

Interview tell: a candidate who instinctively reaches for BASEPRI (not PRIMASK) to build a critical section shows they understand priority-preemption — low-priority FreeRTOS kernel locks don't need to block NMI or a high-priority motor-control ISR.

Operating State Matrix

Privileged

Unprivileged

Thread mode

Privileged Thread
Boot state · kernel code · RTOS idle task · full SCS access

Unprivileged Thread
User tasks · no SCB/NVIC writes · MPU enforces memory isolation

Handler mode

Handler (always Privileged)
Every exception / interrupt runs here · always on MSP · IPSR.ISR # ≠ 0

Transitions — to Handler

Any exception (IRQ, fault, SVC, PendSV, SysTick).
Hardware auto-stacks 8 registers, switches SP to MSP, IPSR ← exception #.

Transitions — back to Thread

Handler returns by loading EXC_RETURN value into PC (via BX LR, POP {PC}, etc.).
Magic bits in EXC_RETURN choose MSP vs PSP on return and restore the prior CONTROL.nPRIV.

Dual Stacks — MSP & PSP

Why two stacks?

MPU can restrict user-task stacks to their own PSP region without blocking the kernel's MSP.
A rogue task that blows its stack cannot corrupt kernel state above — at worst it hits MemManage on its own PSPLIM.
Exception always stacks on the currently-active SP, then switches to MSP for the handler. On return, CPU restores the previous SP selection from EXC_RETURN.

; typical RTOS switch to Unpriv Thread w/ PSP
  LDR   r0, =task_stack_top
  MSR   PSP, r0
  MOVS  r0, #0b011       ; SPSEL=1, nPRIV=1
  MSR   CONTROL, r0
  ISB
  BX    lr               ; return to task

Exception Stack Frame

Hardware pushes exactly the caller-saved set. The handler may compile as an ordinary C function with no prologue tricks.
Stack is aligned to 8 bytes (AAPCS) — CPU inserts padding and flags it in bit 9 of the stacked xPSR.
If FPU is enabled and context has been touched (CONTROL.FPCA=1), an extended frame of 26 words is used.
Lazy stacking (v7E-M, v8-M): space is reserved but S0-S15/FPSCR are not actually written until the handler itself executes an FP instruction. Typical 17-cycle saving.

EXC_RETURN — The Magic LR

On exception entry, LR is loaded with a value where bits [31:4] = 0xFFFFFFF. Bits [3:0] encode how to return:

; v7-M EXC_RETURN values

  0xFFFFFFF1   Handler → Handler   MSP   Basic
  0xFFFFFFF9   Handler → Thread    MSP   Basic
  0xFFFFFFFD   Handler → Thread    PSP   Basic

  0xFFFFFFE1   Handler → Handler   MSP   Extended (FP)
  0xFFFFFFE9   Handler → Thread    MSP   Extended
  0xFFFFFFED   Handler → Thread    PSP   Extended

Executing BX lr with any of these forces an exception-return sequence: unstack the frame, restore xPSR / IPSR, resume.

v8-M extensions

Extra bit [6] distinguishes Secure vs Non-Secure exception return.
Extra bit [0] (S) and bit [5] (DCRS) control whether integrity signature / additional context was stacked during a cross-domain exception.
Attempting BX lr with an invalid EXC_RETURN in Thread mode → UsageFault.

Common bug: a handler written in pure assembler clobbers LR and forgets to reload EXC_RETURN before BX lr. Result: random jump, HardFault, or worse.

Pipeline Comparison

Core	Stages	Pipeline	Branch prediction	Issue width
M0 / M3 / M4	3	Fetch — Decode — Execute	Static (predict not-taken for backward only on M3/M4)	1 (in-order)
M0+	2	Fetch/Decode — Execute	None (single-cycle branch target)	1
M23	2	Fetch — Decode/Execute	Static	1
M33 / M52 / M55	3 – 4	F — D — E (— WB on M55)	Static	1
M7	6	F1 F2 D1 D2 EX WB	Dynamic (BHT, BTB)	Dual-issue (in-order)
M85	7	F1 F2 D1 D2 I EX WB	Dynamic (BHT, BTB, RAS)	Dual-issue (in-order)

Even the deepest Cortex-M pipelines are in-order. There is no register renaming, no out-of-order execution, no speculation past branches without rewinding. This is a deliberate design choice: deterministic WCET matters more than peak IPC for the target workloads.

Endianness & Alignment

Endianness

Cortex-M supports either little- or big-endian — chosen at reset by a strap pin (BIGEND).
Every shipping silicon implementation is little-endian. Big-endian exists in the architecture but you will not see it in the wild.
Data endianness only — instructions are always stored little-endian by the linker.
REV, REV16, REVSH, RBIT instructions for byte/bit swaps.

Alignment

v7-M / v7E-M / v8-M: unaligned word & half-word loads/stores supported in Normal memory (split into aligned beats, several cycles).
v6-M: unaligned access always faults (UsageFault on M3/M4 too if CCR.UNALIGN_TRP=1).
Device & Strongly-Ordered memory: unaligned always faults.
Stack must be 8-byte aligned at exception entry; hardware enforces.

Semaphore Primitives — LDREX / STREX

; atomic increment of *p
try:
    LDREX  r1, [r0]      ; tag exclusive
    ADDS   r1, r1, #1
    STREX  r2, r1, [r0]  ; r2=0 ok, 1 fail
    CMP    r2, #0
    BNE    try
    DMB

Single-core architectural primitive — maps to a local monitor in each Cortex-M.
Any exception, context switch, or write to the tagged address clears the monitor → retry.
Used by FreeRTOS (on M3+) and C11 atomics.
v8-M adds byte/half-byte variants (LDREXB, LDREXH) and clearer ordering rules.
M0 / M0+ / M23 lack LDREX/STREX — they use cpsid i / PRIMASK critical sections instead.

Bus Interfaces

Core	Bus	Notes
M0 / M0+ / M1	AHB-Lite	Single 32-bit master bus; separate PPB for CoreSight/NVIC
M3 / M4	AHB-Lite (I-Code / D-Code / System bus + PPB)	Code-space traffic on I-Code & D-Code; everything else on System bus → SoC can arbitrate independently
M7	AXI-M + AHB peripheral + TCM + ITCM/DTCM	AXI for high-bandwidth external memory; TCM for deterministic code/data
M23 / M33 / M35P	AHB-5 (TrustZone-aware)	Security attributes (HNONSEC) on every transaction
M52 / M55 / M85	AHB-5 + optional AXI + TCM + Co-processor (ACI) bus	ACI exposes dedicated instruction opcodes to an attached accelerator

Harvard-style split (I-bus / D-bus) lets the CPU fetch instructions and operate on data in parallel — critical for hitting the 1 DMIPS/MHz target on a 3-stage pipeline. The bus matrix in the SoC ultimately collapses them onto a unified memory, but the core sees separate paths.

Reset Behaviour

What the CPU does out of reset

Read 32-bit word at address 0x00000000 → load into MSP.
Read 32-bit word at address 0x00000004 → load into PC (reset vector).
Enter Privileged Thread mode, MSP selected, T-bit = 1.
All NVIC IRQs disabled, VTOR = 0, SCB/MPU unconfigured.

Typical startup code

Reset_Handler:
  LDR    r0, =_sidata    ; .data init
  LDR    r1, =_sdata
  LDR    r2, =_edata
copy: CMP r1, r2
  ITT    LT
  LDRLT  r3, [r0], #4
  STRLT  r3, [r1], #4
  BLT    copy

  LDR    r0, =_sbss      ; .bss zero
  LDR    r1, =_ebss
  MOVS   r2, #0
zero: CMP r0, r1
  ITT    LT
  STRLT  r2, [r0], #4
  BLT    zero

  BL     SystemInit      ; PLLs, caches, VTOR
  BL     __libc_init_array
  BL     main

Vendor Variation You Will See

ST — STM32

Largest Cortex-M portfolio. F0 (M0), F1/F3/F4 (M3/M4), F7/H7 (M7), L5/U5 (M33), H5 (M33), U0 (M0+).

Vendor HAL + LL layers on top of CMSIS.

NXP

LPC5500 (M33+M33), i.MX RT (M7 @ 1 GHz "crossover MCU"), Kinetis (M0+/M4), S32K (automotive M4/M7).

Secure wake-up M33 often paired with a bigger A-profile.

Nordic · Silicon Labs · Renesas

nRF52 (M4), nRF53 (dual M33), nRF54 (M33); EFR32 wireless (M33); Renesas RA (M33/M85).

Strong wireless-stack + TrustZone story.

Microchip · Infineon · TI

SAM family (M0+/M4/M7/M23); PSoC 6 (M4+M0+); TM4C / MSP432 (M4/M4F).

Ambiq · Alif

Ambiq Apollo (M4F/M55, subthreshold, < 6 µA/MHz). Alif Ensemble (M55 + M55 + NPU).

Edge-ML showpieces.

Raspberry Pi

RP2040 (dual M0+), RP2350 (dual M33 or dual Hazard-3 RISC-V — first official multi-arch MCU).

CMSIS — The Portable Layer

CMSIS-CORE — C headers for registers, intrinsics (__WFI, __DMB), NVIC/SCB/MPU inline functions. Every Cortex-M vendor ships this.
CMSIS-DSP — fixed/float DSP kernels; uses DSP & FPU instructions when available, scalar fallback when not.
CMSIS-NN — int8/int16 NN primitives optimised for M4 DSP and M55/M85 Helium.
CMSIS-RTOS v2 — thin API spec; FreeRTOS/Zephyr/RTX implement it.
CMSIS-Pack — device description files, used by Keil µVision, VS Code embedded tools, Arm Clang.

/* Portable: same on any Cortex-M */
#include "cmsis_compiler.h"

void enter_critical(void)
{
    __disable_irq();          /* cpsid i */
    __DMB();
}

uint32_t atomic_load(uint32_t *p)
{
    uint32_t v = __LDREXW(p);
    __CLREX();                /* drop monitor */
    return v;
}

A huge reason Cortex-M dominates: driver code written in 2009 against CMSIS-CORE still compiles and runs on a 2024 Cortex-M85.

Performance Snapshot

Core	Typical f_max	CoreMark/MHz	DMIPS/MHz	Typical silicon (65/40/22 nm)
M0	~50 MHz	2.33	0.9	< 12k gates
M0+	~50 MHz	2.46	0.95	< 12k gates
M3	~100 MHz	3.32	1.25	~30k gates
M4	~120 MHz	3.40	1.25	~35k (no FPU) / ~55k (with FPU)
M7	600 MHz	5.01	2.14	~300k gates (+ caches)
M33	~160 MHz	4.02	1.5	~45k gates
M55	~400 MHz	4.35	1.7	~300k (with Helium)
M85	~700 MHz	6.28	4.0	> 600k gates (with options)

Numbers from Arm's published figures (2024/2025). Actual silicon depends heavily on library, process, and whether the vendor flushed timing to meet f_max or area.

Choosing a Cortex-M — Decision Flow

References & Further Reading

Arm — Armv6-M / Armv7-M / Armv8-M Architecture Reference Manuals (ARM DDI 0419, 0403, 0553)
Arm — Cortex-M0/M0+/M3/M4/M7/M23/M33/M55/M85 Technical Reference Manuals
Joseph Yiu — The Definitive Guide to Arm Cortex-M23 & Cortex-M33 Processors (Newnes, 2021)
Joseph Yiu — The Definitive Guide to Arm Cortex-M3 and Cortex-M4 Processors, 3rd ed. (Newnes, 2014)
Jonathan Valvano — Embedded Systems: Real-Time Interfacing to ARM Cortex-M Microcontrollers (2017)
Arm Developer — developer.arm.com/documentation — free TRMs, QRCs, CMSIS headers
Arm Community — CMSIS-Core on GitHub (github.com/ARM-software/CMSIS_5)
Wikipedia — "ARM Cortex-M" family article — a surprisingly good cross-reference table

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use. Code examples provided as-is.