The default 32-bit MCU ISA. Virtually every major MCU vendor (ST, NXP, Nordic, Silicon Labs, Renesas, Microchip, Infineon, TI, Ambiq, GigaDevice, Raspberry Pi) ships Cortex-M silicon.
Volume. Arm-based microcontrollers ship at tens of billions of units per year; Cortex-M is the largest slice.
Span. One family from a $0.10 Cortex-M0 to a 600 MHz cache-equipped Cortex-M85 — same toolchain, same CMSIS headers.
Ubiquity. Also lives inside bigger chips: as a secure enclave (Cortex-M33), a wake-up controller, or a peripheral offload engine in many A-profile SoCs.
Interview context
Cortex-M is the one architecture an embedded-firmware engineer is almost guaranteed to be tested on. Expect questions on the exception model, memory map, NVIC, MPU, and increasingly TrustZone.
"The M-profile is the only 32-bit architecture designed from the ground up for deterministic interrupt latency and bit-level peripheral control — everything else is a general-purpose CPU with interrupts bolted on."
— Paraphrasing Joseph Yiu's design philosophy
03
Arm Profiles: A, R, M
Profile
Target
Memory
OS
Examples
A Application
Smartphones, servers, laptops
MMU with virtual memory (Sv39 etc.)
Linux, Android, Windows, iOS
Cortex-A78, Neoverse N2, Apple M-series
R Real-Time
Automotive, baseband, storage
MPU (protection, no virt)
AUTOSAR, QNX, bare-metal
Cortex-R52, R82
M Microcontroller
Deeply embedded, IoT, sensors
Optional MPU · physical addresses only
FreeRTOS, Zephyr, bare-metal
Cortex-M0+, M4, M33, M85
What makes M-profile different
Thumb-only — no A32/A64 mode; the core starts in Thumb state and stays there.
NVIC is architectural, not an SoC add-on. Every Cortex-M has one; every C compiler knows about it.
Exception entry in hardware: automatic stacking of the caller-saved register set.
No virtual memory. Addresses go directly to the bus.
What M-profile does not have
No A64 (AArch64) instructions — 32-bit only.
No virtualisation extensions (EL2).
No SMP coherency protocol. Multi-core Cortex-M is asymmetric.
Caches are optional and only appear on M7/M55/M85.
04
The Cortex-M Family Timeline
Dates indicate initial announcement. v8.1-M (Helium/MVE) layered on top of v8-M Mainline adds the vector extension used by M52 · M55 · M85.
05
Choose a Core — Interactive
Click a core to see its headline features. Same silicon IP block, vastly different trade-offs.
Headline numbers are upper-bound estimates from Arm's own benchmarking. Real silicon depends on memory latency, wait states, and which options the SoC vendor licensed.
07
Armv6-M — The Minimal M-Profile
Instruction set
Thumb subset: ~56 instructions.
No integer divide — software emulation.
No bit-field ops, no saturating arithmetic.
32×32→32 multiply (1-cycle or 32-cycle by option).
IT block present but limited (only 32-bit BL, SVC, MRS/MSR, DSB/DMB/ISB are 32-bit encodings).
System features
No bit-banding.
No BASEPRI — only PRIMASK for IRQ masking.
Max 32 external IRQs, 4 priority levels.
MPU optional on M0+ (8 regions) — not on M0/M1.
Single vector table; no VTOR on M0 (relocation via remap-to-SRAM trick on some SoCs); VTOR on M0+.
Why it still matters: v6-M is cheap enough to drop in as a wake-up/boot controller on top of something far bigger. It is also the ISA targeted by dozens of Arm-compatible open clones.
08
Armv7-M / v7E-M — Full Thumb-2
v7-M (M3)
Full Thumb-2 (~200 instructions).
Hardware divide (SDIV/UDIV, 2–12 cycles).
Bit-field: BFI, BFC, UBFX, SBFX.
Exclusive access: LDREX/STREX (word, half, byte).
Bit-banding, NVIC, full MPU option.
Up to 240 IRQs, 8-bit priority field (implementations expose 3–8 bits).
v7E-M adds (M4, M7)
DSP extension: SSAT/USAT, QADD/QSUB, QADD8/QADD16, SMLAD, SMLAL, SMMUL etc.
Packed-operand SIMD on 32-bit registers (2×16-bit or 4×8-bit lanes).
Optional single-precision FPU (FPv4-SP on M4, FPv5-SP or FPv5-DP on M7).
The DSP extension alone is a major interview topic for any DSP-adjacent role — see presentation 04.
09
Armv8-M — TrustZone Arrives
v8-M Baseline (M23)
Armv6-M + TrustZone (the "hardware security" part).
Co-processor interface (ACI) for vendor-specific instructions.
Separate S/NS banks for SP, PSPLIM, MSPLIM, CONTROL, FAULTMASK, BASEPRI, PRIMASK.
Armv8.1-M (M52/M55/M85) — layered on top of v8-M Mainline. Adds Helium (MVE), low-overhead loops (LO Branch extension), custom-instruction framework, optional PACBTI. See presentation 04.
10
Thumb-2 — Mixed 16/32-bit Encoding
The idea
Cortex-M runs only in Thumb state — no A32 (ARM state), no A64.
Most common instructions encode in 16 bits → excellent code density.
Less-common or wider-immediate variants use a 32-bit encoding, distinguishable by the top-5 bits of the half-word.
Result: ≈ ARM 32-bit performance at ≈ ARM code size divided by 1.3.
Code density matters more than clock speed when your flash is 64 KB and costs $0.02/KB.
Link register. On exception: holds EXC_RETURN magic value.
R15 (PC)
Program counter. Always even (Thumb bit live in instruction address LSB).
AAPCS calling convention
Args in R0–R3, then stack.
Return value in R0 (or R0:R1 for 64-bit).
R4–R11, LR preserved across calls.
Stack 8-byte aligned at public interfaces.
The auto-stacked set {R0-R3, R12, LR, PC, xPSR} = exactly the caller-saved set + return state. Hardware is effectively doing the function-prologue save for the ISR.
13
xPSR — The Program Status Register
Three logical views of one physical register: APSR (flags), EPSR (execution state), IPSR (active exception number). Separate MRS/MSR instructions read/write each subset. On exception entry, the whole xPSR is pushed as the 8th auto-stacked word.
14
Special Registers
Reg
Width
Purpose
Access
PRIMASK
1 bit
Master IRQ disable. cpsid i sets; cpsie i clears. Blocks all exceptions except NMI and HardFault.
Privileged
FAULTMASK
1 bit
Like PRIMASK but also blocks HardFault. Auto-cleared on return from handler.
Privileged · not on v6-M
BASEPRI
8 bit
Block all IRQs of numerical priority ≥ BASEPRI (lower number = higher priority). Finer-grained than PRIMASK.
Privileged · not on v6-M
CONTROL
3 bit
Bit 0 (nPRIV): Thread mode privilege; 1 = unprivileged.
Bit 1 (SPSEL): stack in use; 0 = MSP, 1 = PSP (in Thread mode).
Bit 2 (FPCA): FPU context active — controls lazy stacking.
Privileged (writing)
MSPLIM / PSPLIM
32 bit
Stack-pointer lower limit. Hardware UsageFault on stack descent past limit.
Privileged · v8-M Mainline only
Interview tell: a candidate who instinctively reaches for BASEPRI (not PRIMASK) to build a critical section shows they understand priority-preemption — low-priority FreeRTOS kernel locks don't need to block NMI or a high-priority motor-control ISR.
15
Operating State Matrix
Privileged
Unprivileged
Thread mode
Privileged Thread Boot state · kernel code · RTOS idle task · full SCS access
Unprivileged Thread User tasks · no SCB/NVIC writes · MPU enforces memory isolation
Handler mode
Handler (always Privileged) Every exception / interrupt runs here · always on MSP · IPSR.ISR # ≠ 0
Handler returns by loading EXC_RETURN value into PC (via BX LR, POP {PC}, etc.).
Magic bits in EXC_RETURN choose MSP vs PSP on return and restore the prior CONTROL.nPRIV.
16
Dual Stacks — MSP & PSP
Why two stacks?
MPU can restrict user-task stacks to their own PSP region without blocking the kernel's MSP.
A rogue task that blows its stack cannot corrupt kernel state above — at worst it hits MemManage on its own PSPLIM.
Exception always stacks on the currently-active SP, then switches to MSP for the handler. On return, CPU restores the previous SP selection from EXC_RETURN.
Hardware pushes exactly the caller-saved set. The handler may compile as an ordinary C function with no prologue tricks.
Stack is aligned to 8 bytes (AAPCS) — CPU inserts padding and flags it in bit 9 of the stacked xPSR.
If FPU is enabled and context has been touched (CONTROL.FPCA=1), an extended frame of 26 words is used.
Lazy stacking (v7E-M, v8-M): space is reserved but S0-S15/FPSCR are not actually written until the handler itself executes an FP instruction. Typical 17-cycle saving.
18
EXC_RETURN — The Magic LR
On exception entry, LR is loaded with a value where bits [31:4] = 0xFFFFFFF. Bits [3:0] encode how to return:
Executing BX lr with any of these forces an exception-return sequence: unstack the frame, restore xPSR / IPSR, resume.
v8-M extensions
Extra bit [6] distinguishes Secure vs Non-Secure exception return.
Extra bit [0] (S) and bit [5] (DCRS) control whether integrity signature / additional context was stacked during a cross-domain exception.
Attempting BX lr with an invalid EXC_RETURN in Thread mode → UsageFault.
Common bug: a handler written in pure assembler clobbers LR and forgets to reload EXC_RETURN before BX lr. Result: random jump, HardFault, or worse.
19
Pipeline Comparison
Core
Stages
Pipeline
Branch prediction
Issue width
M0 / M3 / M4
3
Fetch — Decode — Execute
Static (predict not-taken for backward only on M3/M4)
1 (in-order)
M0+
2
Fetch/Decode — Execute
None (single-cycle branch target)
1
M23
2
Fetch — Decode/Execute
Static
1
M33 / M52 / M55
3 – 4
F — D — E (— WB on M55)
Static
1
M7
6
F1 F2 D1 D2 EX WB
Dynamic (BHT, BTB)
Dual-issue (in-order)
M85
7
F1 F2 D1 D2 I EX WB
Dynamic (BHT, BTB, RAS)
Dual-issue (in-order)
Even the deepest Cortex-M pipelines are in-order. There is no register renaming, no out-of-order execution, no speculation past branches without rewinding. This is a deliberate design choice: deterministic WCET matters more than peak IPC for the target workloads.
20
Endianness & Alignment
Endianness
Cortex-M supports either little- or big-endian — chosen at reset by a strap pin (BIGEND).
Every shipping silicon implementation is little-endian. Big-endian exists in the architecture but you will not see it in the wild.
Data endianness only — instructions are always stored little-endian by the linker.
REV, REV16, REVSH, RBIT instructions for byte/bit swaps.
Alignment
v7-M / v7E-M / v8-M: unaligned word & half-word loads/stores supported in Normal memory (split into aligned beats, several cycles).
v6-M: unaligned access always faults (UsageFault on M3/M4 too if CCR.UNALIGN_TRP=1).
Single-core architectural primitive — maps to a local monitor in each Cortex-M.
Any exception, context switch, or write to the tagged address clears the monitor → retry.
Used by FreeRTOS (on M3+) and C11 atomics.
v8-M adds byte/half-byte variants (LDREXB, LDREXH) and clearer ordering rules.
M0 / M0+ / M23 lack LDREX/STREX — they use cpsid i / PRIMASK critical sections instead.
22
Bus Interfaces
Core
Bus
Notes
M0 / M0+ / M1
AHB-Lite
Single 32-bit master bus; separate PPB for CoreSight/NVIC
M3 / M4
AHB-Lite (I-Code / D-Code / System bus + PPB)
Code-space traffic on I-Code & D-Code; everything else on System bus → SoC can arbitrate independently
M7
AXI-M + AHB peripheral + TCM + ITCM/DTCM
AXI for high-bandwidth external memory; TCM for deterministic code/data
M23 / M33 / M35P
AHB-5 (TrustZone-aware)
Security attributes (HNONSEC) on every transaction
M52 / M55 / M85
AHB-5 + optional AXI + TCM + Co-processor (ACI) bus
ACI exposes dedicated instruction opcodes to an attached accelerator
Harvard-style split (I-bus / D-bus) lets the CPU fetch instructions and operate on data in parallel — critical for hitting the 1 DMIPS/MHz target on a 3-stage pipeline. The bus matrix in the SoC ultimately collapses them onto a unified memory, but the core sees separate paths.
23
Reset Behaviour
What the CPU does out of reset
Read 32-bit word at address 0x00000000 → load into MSP.
Read 32-bit word at address 0x00000004 → load into PC (reset vector).
Enter Privileged Thread mode, MSP selected, T-bit = 1.
All NVIC IRQs disabled, VTOR = 0, SCB/MPU unconfigured.
RP2040 (dual M0+), RP2350 (dual M33 or dual Hazard-3 RISC-V — first official multi-arch MCU).
25
CMSIS — The Portable Layer
CMSIS-CORE — C headers for registers, intrinsics (__WFI, __DMB), NVIC/SCB/MPU inline functions. Every Cortex-M vendor ships this.
CMSIS-DSP — fixed/float DSP kernels; uses DSP & FPU instructions when available, scalar fallback when not.
CMSIS-NN — int8/int16 NN primitives optimised for M4 DSP and M55/M85 Helium.
CMSIS-RTOS v2 — thin API spec; FreeRTOS/Zephyr/RTX implement it.
CMSIS-Pack — device description files, used by Keil µVision, VS Code embedded tools, Arm Clang.
/* Portable: same on any Cortex-M */
#include "cmsis_compiler.h"
void enter_critical(void)
{
__disable_irq(); /* cpsid i */
__DMB();
}
uint32_t atomic_load(uint32_t *p)
{
uint32_t v = __LDREXW(p);
__CLREX(); /* drop monitor */
return v;
}
A huge reason Cortex-M dominates: driver code written in 2009 against CMSIS-CORE still compiles and runs on a 2024 Cortex-M85.
26
Performance Snapshot
Core
Typical fmax
CoreMark/MHz
DMIPS/MHz
Typical silicon (65/40/22 nm)
M0
~50 MHz
2.33
0.9
< 12k gates
M0+
~50 MHz
2.46
0.95
< 12k gates
M3
~100 MHz
3.32
1.25
~30k gates
M4
~120 MHz
3.40
1.25
~35k (no FPU) / ~55k (with FPU)
M7
600 MHz
5.01
2.14
~300k gates (+ caches)
M33
~160 MHz
4.02
1.5
~45k gates
M55
~400 MHz
4.35
1.7
~300k (with Helium)
M85
~700 MHz
6.28
4.0
> 600k gates (with options)
Numbers from Arm's published figures (2024/2025). Actual silicon depends heavily on library, process, and whether the vendor flushed timing to meet fmax or area.
27
Choosing a Cortex-M — Decision Flow
28
References & Further Reading
Arm — Armv6-M / Armv7-M / Armv8-M Architecture Reference Manuals (ARM DDI 0419, 0403, 0553) Arm — Cortex-M0/M0+/M3/M4/M7/M23/M33/M55/M85 Technical Reference Manuals Joseph Yiu — The Definitive Guide to Arm Cortex-M23 & Cortex-M33 Processors (Newnes, 2021) Joseph Yiu — The Definitive Guide to Arm Cortex-M3 and Cortex-M4 Processors, 3rd ed. (Newnes, 2014) Jonathan Valvano — Embedded Systems: Real-Time Interfacing to ARM Cortex-M Microcontrollers (2017) Arm Developer — developer.arm.com/documentation — free TRMs, QRCs, CMSIS headers Arm Community — CMSIS-Core on GitHub (github.com/ARM-software/CMSIS_5) Wikipedia — "ARM Cortex-M" family article — a surprisingly good cross-reference table
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use. Code examples provided as-is.