Every Cortex-M has the same debug & trace IP — CoreSight — so the same probe drives an STM32, Nordic nRF52, and NXP i.MX RT.
The debug system is intrusive (halt, single-step, read memory) and non-intrusive (trace, watchpoints with hit counts, printf via ITM).
On M0/M0+/M23 the footprint is minimal (8 HW breakpoints → 4 → 2, no ETM); on M7/M55/M85 it is a full-strength trace infrastructure.
A debugger finds 80% of bugs; a tracer finds the remaining 20% — timing, race conditions, ISR interactions.
Interview angle
A strong candidate can sketch the CoreSight block diagram on the whiteboard, explain what the DAP is, and describe how printf() makes it off-chip via ITM+SWO without stopping the CPU.
03
CoreSight Block Diagram
04
SWD vs JTAG
JTAG (4/5 wires)
SWD (2 wires)
Pins
TCK, TMS, TDI, TDO (+nTRST)
SWCLK, SWDIO
Bandwidth
~25 Mbps typ.
~50 Mbps typ. (half-duplex)
Daisy-chaining
IEEE 1149.1 scan chain
Multi-drop SWD (ARM IHI 0031)
Boundary scan
Yes
No
Typical MCU pinout
Optional — shares with SWJ-DP if SWJ mode
Default on nearly every Cortex-M
SWJ-DP is a dual-role block that speaks either JTAG or SWD; the probe issues a magic sequence (0x79E7) on the TMS/SWDIO pin to switch modes.
SWO (Serial Wire Output) is a one-way UART-style pin emitting ITM/TPIU packets at up to 200 Mbps UART or Manchester-encoded.
Why SWD won: MCUs have few pins. Two-wire SWD gives equivalent debug without the TDI/TDO cost. Every vendor defaults to SWD today, with JTAG optional.
05
The Debug Access Port (DAP)
Small bus master inside the SoC; accepts probe commands and performs memory/register reads/writes without halting the CPU.
AP — Access Port. One AHB-AP per bus master (typically: system bus + PPB). APB-AP for CoreSight components.
Register model is memory-mapped over the probe transport — the probe issues 36-bit packets (8-bit header + 32-bit data + parity) on SWD.
/* Pseudo — probe interaction */
DP:SELECT = AP 0, bank 0 ; pick AHB-AP
AP:CSW = 0x23000002 ; 32-bit word, auto-inc
AP:TAR = 0x20000000 ; target addr
AP:DRW = read/write ; data read or write
AP:TAR auto-incs 4 bytes ; burst reads follow
/* High-level: dump 1 KiB of SRAM without halting */
CMSIS-DAP and PyOCD expose this at a higher level.
06
Halting Debug — the Classic Loop
Debug State: CPU halted, pipeline frozen; GPRs and memory accessible via DAP.
Enter via:
Halt request from probe (DHCSR.C_HALT).
Hit on a breakpoint / watchpoint.
Fault that escalates to DebugMon or HardFault while debug is enabled.
Exit via DHCSR.C_HALT=0 (resume) or DHCSR.C_STEP=1 (single-step).
Halting is allowed to keep peripherals running — the SoC optionally freezes timers & watchdog via a "debug freeze" register.
/* DHCSR — 0xE000EDF0
Top half = 0xA05F key required on writes. */
#define DHCSR (*(volatile uint32_t*)0xE000EDF0)
/* Halt */ DHCSR = 0xA05F0003; /* C_DEBUGEN | C_HALT */
/* Step */ DHCSR = 0xA05F000D; /* + C_STEP + MASKINTS */
/* Run */ DHCSR = 0xA05F0001; /* C_DEBUGEN only */
Do not write DHCSR from firmware — it's there for the probe. Firmware writing it will confuse IDE state.
07
Flash Patch & Breakpoint Unit (FPB)
Sets hardware breakpoints by comparing PC to a small set of registers.
Breakpoint slots per core:
Cortex-M0 / M1: 4
Cortex-M0+ / M23: 4
Cortex-M3 / M4 / M33: 6 or 8
Cortex-M7 / M85: 8
Software breakpoints via BKPT #imm — unlimited, but require patching code in flash (fine on RAM-resident code).
FPB "patch" mode (M3/M4 only) can substitute up to 2 fetched words — used for flash-remap tricks, deprecated on v8-M.
SW vs HW breakpoints in practice
GDB places HW breakpoints in flash (where it can't write) and SW breakpoints in RAM. OpenOCD / pyOCD juggle slots automatically, throwing "no HW breakpoints left" when you exceed the limit.
Watch out: asserting breakpoints in an ISR quickly runs through slots — a typical 6-slot M4 can't maintain a breakpoint per RTOS task.
08
Data Watchpoint & Trace (DWT)
Watchpoints
4 comparators (M3/M4), up to 16 (M7+).
Each matches on PC, address, value, or cycle count.
PC sampling — periodic PC snapshot for statistical profiling.
Watchpoint-match events.
Cycle-count timestamp packets.
Arm Streamline + Keil µVision
Combine ITM PC-sampling + exception-trace to produce a per-function CPU-utilisation view, live, with no firmware instrumentation. Great for finding accidental 100%-CPU loops.
11
Embedded Trace Macrocell (ETM)
Optional block — on M4/M7/M33/M55/M85.
Produces a compressed stream of branch packets: "I took this branch at this timestamp."
Combined with the static code image, trace tools reconstruct the exact instruction sequence executed.
Non-intrusive — CPU runs at full speed.
Output over:
SWO pin — ~1–20 MB/s (same as ITM path).
TRACEDATA[0..3] pins — up to 200+ MB/s. Needs debug probe with parallel trace support (JLink Ultra+, Arm ULINKpro).
When you actually need ETM
Intermittent HardFaults — look at the last N thousand instructions.
Interrupt-order races that a breakpoint would mask.
Performance hotspots at the basic-block level.
Certification traces (ISO 26262, IEC 61508).
12
TPIU, Trace Clocks & Pin Muxing
TPIU (Trace Port Interface Unit) serialises the merged trace stream for export.
At 0xE00F_F000 every Cortex-M has a ROM table — a list of pointers describing the debug components present.
Probe enumerates by walking the table; fingerprints the SoC.
Each component has a peripheral ID (ARM DDI 0314 CoreSight Architecture) — ITM = 0x001, DWT = 0x002, FPB = 0x003, etc.
This is how CMSIS-DAP / pyOCD auto-detect which features the target supports without device-specific code.
On the wire
When you plug in a new board and run pyocd list, the tool issues ~20 SWD reads to walk the ROM table and matches component IDs against its internal catalogue.
Writing a new driver? Start by dumping the ROM table. Everything interesting about the SoC's debug surface is described there.
15
Semihosting
A pre-CoreSight convention: use BKPT #0xAB as a system call to the debugger.
Provides SYS_OPEN, SYS_WRITE, SYS_READ, SYS_TIME, SYS_EXIT etc.
Great for CI tests — firmware calls fopen("test.log","w") and the file lands on the host.
Terrible for production: blocks CPU for ~200 kcycle per call; firmware hangs if no debugger is attached.
/* Typical use in test firmware */
static inline int semihost(int op, void *arg)
{
register int r0 __asm("r0") = op;
register void *r1 __asm("r1") = arg;
__asm volatile (
"bkpt 0xAB" : "+r"(r0) : "r"(r1) : "memory");
return r0;
}
/* SYS_WRITE0 = 0x04, arg = null-term string */
semihost(0x04, "Hello from the chip\n");
Always gate semihosting behind a "debugger attached" check (CoreDebug->DHCSR & C_DEBUGEN). Otherwise a field unit hangs forever at the first BKPT.
16
SEGGER Real-Time Transfer (RTT)
Pure-software alternative to ITM: a ring-buffer in SRAM that the debug probe polls over the DAP.
No extra pins, no special hardware — works on every Cortex-M, even M0.
Bi-directional: 8 up-channels and 8 down-channels.
Measured throughput: ~1 MB/s on a JLink over SWD.
"Fire-and-forget" log calls: ~10 cycles each, safe from any context including fault handlers.
Why RTT beats ITM for most teams
Works with any probe that can do DAP memory reads (JLink, OpenOCD, pyOCD).
No SWO pin or baud-rate configuration.
Identical semantics on M0 (no ITM) and M7 (has ITM).
Back-channel input ("enter a test command") without USB CDC.
Trade-off: RTT is busy-polled by the probe; ITM is hardware-driven into SWO. ITM has a little less jitter for time-sensitive traces.
17
Probes in the Wild
Probe
SWD / JTAG
ETM parallel
Host tools
SEGGER J-Link
✓ up to 50 MHz
J-Trace Pro (4-bit)
Ozone, GDB, Keil, IAR
ST-Link V3
✓
2-bit trace
STM32CubeIDE, OpenOCD
CMSIS-DAP / DAPLink
✓
—
pyOCD, OpenOCD
Arm ULINKplus / pro
✓
✓
Keil µVision, Arm DS
Raspberry Pi Debug Probe
✓ (CMSIS-DAP FW)
—
OpenOCD, GDB
Lauterbach TRACE32
✓
✓ (full ETM)
TRACE32 PowerView
CMSIS-DAP is Arm's open reference probe firmware — shipping on every eval board since 2017. "DAPLink" is Arm Mbed's productised version.
18
A Typical gdb + OpenOCD Session
# terminal 1
$ openocd -f interface/cmsis-dap.cfg -f target/stm32f4x.cfg
Open On-Chip Debugger 0.12.0
Info : CMSIS-DAP: SWD supported
Info : STM32F411CEUx found
Info : target halted due to debug-request, current mode: Thread
Info : Listening on port 3333 for gdb connections
# terminal 2
$ arm-none-eabi-gdb ./firmware.elf
(gdb) target extended-remote :3333
(gdb) load
Loading section .isr_vector, size 0x1c0 lma 0x8000000
Loading section .text, size 0x41d8 lma 0x80001c0
Start address 0x08000198, load size 17448
Transfer rate: 12 KB/sec, 2181 bytes/write.
(gdb) monitor reset halt
(gdb) b main
(gdb) c
Breakpoint 1, main () at src/main.c:42
42 HAL_Init();
(gdb) info registers
r0 0x20000000 536870912
...
(gdb) monitor mww 0x4002103C 0x00000001 ; toggle a peripheral
(gdb) step
19
Trace Analysis Tools
Percepio Tracealyzer
Consumes custom RTOS hook output (FreeRTOS, Zephyr, ThreadX) via RTT/ITM; shows task scheduling, priority inversion, queue timing in a visual timeline. De-facto RTOS trace tool.
Arm Streamline
Consumes ITM PC-sampling + DWT exception-trace. Produces flame-graph CPU profile + per-function heat map. Shines with ETM.
SEGGER SystemView
Free companion to RTT-based firmware. Event-visualiser for RTOS calls; integrates with J-Link Ozone.
Bugs that breakpoints hide from you — priority-inversion races, ISR starvation, DMA-to-cache coherency regressions — are exactly the bugs a trace tool shows. Budget ETM/ITM pins on every new board.
20
Debug in Production
Why lock it down?
An attacker with SWD access can read firmware, extract keys, patch code.
SWD is on by default; pads are routed to test points on almost every board.
How
Readout Protection (RDP) — vendor-specific option byte; Level 2 permanently disables the DAP. STM32, NXP, Nordic all have this.
DAP_AUTH challenge-response — newer SoCs (STM32U5, nRF54) require signed token from the probe.
Secure-debug channel — in Armv8-M, DAUTHCTRL gates halting/debug per security state.
Side-channel reality
Setting RDP is not enough — glitching and fault-injection can revive the DAP on many designs. Cortex-M35P was purpose-built with hardware countermeasures; M33 parts advertise similar features as "attack detection".
In practice: ship with RDP1 (unlocked by mass-erase), document the unlock path internally. RDP2 only for "never debuggable again" production SKUs.
21
TrustZone Impact on Debug
v8-M cores add two per-state gates: Secure debug enable (SDE) and Non-Secure debug enable (NSDE).
If SDE=0, the probe cannot halt the CPU while it is executing Secure code. Memory reads of S regions are RAZ/WI.
This lets secure bootloader authors run RDP-like policies at the state boundary — useful for PSA attestation.
Halt/step from NS code still works when NSDE=1, even if S is locked. Developers of NS app firmware can still debug without unlocking S.
The Authenticated-Debug flow
Probe reads a public challenge from the SoC.
Developer signs challenge with a per-project key held by secure engineering.
Probe replies with the signed token.
On-chip secure HW verifies → raises SDE.
Implemented by Arm PSA authentic debug (ADAC); ship with M33 / M85 in many production SKUs.
22
Common Debug Bugs
1. SWD pin muxed as GPIO
A firmware glitch reconfigures SWCLK/SWDIO to GPIO → next debugger connect fails. Use the probe's "connect under reset" mode to recover.
2. Watchdog resets during halt
Halted CPU can't kick the watchdog. Configure the SoC's debug-freeze bit for IWDG/WWDG, or enable the IDE's "halt-suppresses-WD" option.
3. Cached D-cache hides reality
Memory read from probe goes through the DAP → often direct to RAM, bypassing the CPU cache. A variable shown as "stale" in the IDE is actually in the CPU's dirty cache line.
4. Low-power sleep kills the probe link
STOP mode gates the AHB-AP clock → SWD drops. Configure DBGMCU_CR to keep debug running in STOP/STANDBY during bring-up.
5. ITM output stops after reset
Re-enable TRCENA + LAR + TCR after every reset. It is easy to forget because the IDE may do it implicitly on attach, but cold-boot firmware must do it itself.
6. Breakpoints in flash-XIP code
Flash-XIP (QSPI) does not support SW breakpoint writes; you only get HW slots. Place critical test code in RAM during bring-up.
23
Production-Ready Debug Plan
Always wire up SWD (2 pins) to a test header — even on the smallest product.
Leave SWO available (one pin) unless the board is truly pin-starved.
Use RTT from day one; ITM for pin-scarce builds; ETM only when needed.
Keep a cycle-counted BSOD fault handler that dumps MSP/PSP, CFSR, HFSR, MMFAR, BFAR over RTT and the last MTB/ETM window over a predictable channel.
Set up the watchdog to freeze on debug during dev, run always in production.
Have a documented procedure to unlock RDP for returned units (or accept permanent lockdown).
Minimum CI setup
Raspberry Pi 4 + CMSIS-DAP probe ($25 total).
pyOCD flashes firmware, runs pytest-embedded.
Test firmware uses semihosting for assertion results.
Pass/fail propagates to GitHub Actions via exit code.
Total: < $40 of hardware per rack slot; catches regressions the instant they land.
24
References
Arm — CoreSight Architecture Specification (ARM IHI 0029) Arm — Arm Debug Interface (ADIv5, ADIv6) specifications (IHI 0031) Arm — Cortex-M3 / M4 / M7 Technical Reference Manuals — chapters on DWT, ITM, FPB, ETM Arm — Armv8-M Debug Architecture extensions, Authenticated Debug (ADAC) Joseph Yiu — Definitive Guide to Cortex-M3/M4, Chapter 14 — debug and trace architecture SEGGER — Application Note — Real-Time Transfer (RTT) and J-Link User Guide pyOCD — python CoreSight driver library, github.com/pyocd/pyOCD OpenOCD — open-source debug, github.com/openocd-org/openocd — best reference source for CoreSight protocols
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.