ARM CORTEX-M · PRESENTATION 06

Debug & Trace — CoreSight

SWD / JTAG · DAP · FPB · DWT · ITM · ETM · TPIU · MTB · RTT · Semihosting
How a 2-wire link reveals the entire CPU state
02

Why CoreSight Matters

  • Every Cortex-M has the same debug & trace IP — CoreSight — so the same probe drives an STM32, Nordic nRF52, and NXP i.MX RT.
  • The debug system is intrusive (halt, single-step, read memory) and non-intrusive (trace, watchpoints with hit counts, printf via ITM).
  • On M0/M0+/M23 the footprint is minimal (8 HW breakpoints → 4 → 2, no ETM); on M7/M55/M85 it is a full-strength trace infrastructure.
  • A debugger finds 80% of bugs; a tracer finds the remaining 20% — timing, race conditions, ISR interactions.

Interview angle

A strong candidate can sketch the CoreSight block diagram on the whiteboard, explain what the DAP is, and describe how printf() makes it off-chip via ITM+SWO without stopping the CPU.

03

CoreSight Block Diagram

Debug Probe JLink · ST-Link · CMSIS-DAP USB ↔ SWD/JTAG DAP SWJ-DP / SW-DP AHB-AP → system bus APB-AP → CoreSight bus master inside the SoC Cortex-M core FPB (breakpoints) DWT (watchpoints) ITM (stim. trace) ETM (instr. trace) MTB (M0+ only) NVIC / SCB / MPU SCS — System Control Space Trace Funnel merges ITM + ETM TPIU trace port / SWO Package pins SWDIO & SWCLK SWO (trace) TRACEDATA[n] (ETM) nRESET (or JTAG: TCK/TMS/TDI/TDO)
04

SWD vs JTAG

JTAG (4/5 wires)SWD (2 wires)
PinsTCK, TMS, TDI, TDO (+nTRST)SWCLK, SWDIO
Bandwidth~25 Mbps typ.~50 Mbps typ. (half-duplex)
Daisy-chainingIEEE 1149.1 scan chainMulti-drop SWD (ARM IHI 0031)
Boundary scanYesNo
Typical MCU pinoutOptional — shares with SWJ-DP if SWJ modeDefault on nearly every Cortex-M
  • SWJ-DP is a dual-role block that speaks either JTAG or SWD; the probe issues a magic sequence (0x79E7) on the TMS/SWDIO pin to switch modes.
  • SWO (Serial Wire Output) is a one-way UART-style pin emitting ITM/TPIU packets at up to 200 Mbps UART or Manchester-encoded.
Why SWD won: MCUs have few pins. Two-wire SWD gives equivalent debug without the TDI/TDO cost. Every vendor defaults to SWD today, with JTAG optional.
05

The Debug Access Port (DAP)

  • Small bus master inside the SoC; accepts probe commands and performs memory/register reads/writes without halting the CPU.
  • Two register banks:
    • DP — Debug Port. Selects an AP, manages auth, reports errors (CTRL/STAT).
    • AP — Access Port. One AHB-AP per bus master (typically: system bus + PPB). APB-AP for CoreSight components.
  • Register model is memory-mapped over the probe transport — the probe issues 36-bit packets (8-bit header + 32-bit data + parity) on SWD.
/* Pseudo — probe interaction */
DP:SELECT   = AP 0, bank 0      ; pick AHB-AP
AP:CSW      = 0x23000002         ; 32-bit word, auto-inc
AP:TAR      = 0x20000000         ; target addr
AP:DRW      = read/write         ; data read or write
AP:TAR      auto-incs 4 bytes    ; burst reads follow

/* High-level: dump 1 KiB of SRAM without halting */

CMSIS-DAP and PyOCD expose this at a higher level.

06

Halting Debug — the Classic Loop

  • Debug State: CPU halted, pipeline frozen; GPRs and memory accessible via DAP.
  • Enter via:
    • Halt request from probe (DHCSR.C_HALT).
    • Hit on a breakpoint / watchpoint.
    • Fault that escalates to DebugMon or HardFault while debug is enabled.
  • Exit via DHCSR.C_HALT=0 (resume) or DHCSR.C_STEP=1 (single-step).
  • Halting is allowed to keep peripherals running — the SoC optionally freezes timers & watchdog via a "debug freeze" register.
/* DHCSR — 0xE000EDF0
   Top half = 0xA05F key required on writes. */
#define DHCSR (*(volatile uint32_t*)0xE000EDF0)

/* Halt */ DHCSR = 0xA05F0003;     /* C_DEBUGEN | C_HALT */
/* Step */ DHCSR = 0xA05F000D;     /* + C_STEP + MASKINTS */
/* Run  */ DHCSR = 0xA05F0001;     /* C_DEBUGEN only */
Do not write DHCSR from firmware — it's there for the probe. Firmware writing it will confuse IDE state.
07

Flash Patch & Breakpoint Unit (FPB)

  • Sets hardware breakpoints by comparing PC to a small set of registers.
  • Breakpoint slots per core:
    • Cortex-M0 / M1: 4
    • Cortex-M0+ / M23: 4
    • Cortex-M3 / M4 / M33: 6 or 8
    • Cortex-M7 / M85: 8
  • Software breakpoints via BKPT #imm — unlimited, but require patching code in flash (fine on RAM-resident code).
  • FPB "patch" mode (M3/M4 only) can substitute up to 2 fetched words — used for flash-remap tricks, deprecated on v8-M.

SW vs HW breakpoints in practice

GDB places HW breakpoints in flash (where it can't write) and SW breakpoints in RAM. OpenOCD / pyOCD juggle slots automatically, throwing "no HW breakpoints left" when you exceed the limit.

Watch out: asserting breakpoints in an ISR quickly runs through slots — a typical 6-slot M4 can't maintain a breakpoint per RTOS task.
08

Data Watchpoint & Trace (DWT)

Watchpoints

  • 4 comparators (M3/M4), up to 16 (M7+).
  • Each matches on PC, address, value, or cycle count.
  • Programmable mask for range matches.
  • Triggers: halt, ITM event, ETM start/stop, counter.

Counters

  • CYCCNT — 32-bit free-running cycle count (use it for timing measurements).
  • CPICNT, EXCCNT, SLEEPCNT, LSUCNT, FOLDCNT — 8-bit saturating counters for CPI, exception overhead, sleep, load-store, folded instr.
/* Use DWT CYCCNT for profiling */
#define DWT_CTRL   (*(volatile uint32_t*)0xE0001000)
#define DWT_CYCCNT (*(volatile uint32_t*)0xE0001004)
#define DEMCR      (*(volatile uint32_t*)0xE000EDFC)

void dwt_init(void) {
    DEMCR    |= (1u << 24);   /* TRCENA */
    DWT_CYCCNT = 0;
    DWT_CTRL |= 1;            /* CYCCNTENA */
}

uint32_t cycles_elapsed(uint32_t start) {
    return DWT_CYCCNT - start;
}

Cycle-accurate profiling with a single 32-bit register. Every embedded engineer's best friend.

09

Instrumentation Trace Macrocell (ITM)

  • 32 stimulus ports, each a 32-bit register at 0xE000_0000 .. 0xE000_007C.
  • Writes form packets which are streamed out via SWO / TPIU at the CPU's pace.
  • Zero-cost "printf": a handful of writes per character, no busy-wait, no blocking.
  • Protocol also carries DWT events, timestamp packets, and PC-sampling packets at configurable rate.
  • Host decoder (OpenOCD / JLink / Keil µVision) demultiplexes port 0 → console, port 1 → data log, etc.
/* Classic retargeted putc() */
#define ITM_STIM0 (*(volatile uint32_t*)0xE0000000)
#define ITM_TER   (*(volatile uint32_t*)0xE0000E00)
#define ITM_TCR   (*(volatile uint32_t*)0xE0000E80)

int _write(int fd, const void *buf, size_t len) {
    const char *s = buf;
    while (len--) {
        while (!(ITM_STIM0 & 1));   /* FIFO not full */
        *(volatile uint8_t *)&ITM_STIM0 = *s++;
    }
    return 0;
}

Each byte costs ~3 cycles. A 115 kbps UART costs ~700 cycles per byte by contrast.

10

ITM Ports in Practice

PortTypical use
0Console output — printf / puts
1–3Binary log data (machine-readable)
31Error / assertion channel

DWT events over ITM

  • Exception entry/exit packets (DWT_CTRL.EXCTRCENA).
  • PC sampling — periodic PC snapshot for statistical profiling.
  • Watchpoint-match events.
  • Cycle-count timestamp packets.

Arm Streamline + Keil µVision

Combine ITM PC-sampling + exception-trace to produce a per-function CPU-utilisation view, live, with no firmware instrumentation. Great for finding accidental 100%-CPU loops.

11

Embedded Trace Macrocell (ETM)

  • Optional block — on M4/M7/M33/M55/M85.
  • Produces a compressed stream of branch packets: "I took this branch at this timestamp."
  • Combined with the static code image, trace tools reconstruct the exact instruction sequence executed.
  • Non-intrusive — CPU runs at full speed.
  • Output over:
    • SWO pin — ~1–20 MB/s (same as ITM path).
    • TRACEDATA[0..3] pins — up to 200+ MB/s. Needs debug probe with parallel trace support (JLink Ultra+, Arm ULINKpro).

When you actually need ETM

  • Intermittent HardFaults — look at the last N thousand instructions.
  • Interrupt-order races that a breakpoint would mask.
  • Performance hotspots at the basic-block level.
  • Certification traces (ISO 26262, IEC 61508).
12

TPIU, Trace Clocks & Pin Muxing

  • TPIU (Trace Port Interface Unit) serialises the merged trace stream for export.
  • Two output modes:
    • SWO (serial) — 1 pin at 1–200 Mbps; low-pin-count vendors prefer this.
    • Parallel — 1/2/4-bit TRACEDATA + TRACECLK pins. Higher bandwidth, dedicated pins.
  • Pin muxing is SoC-specific: on STM32 MCUs, the same pins are candidate TRACEDATA but usually shared with GPIO. Board designs commit one way.
  • SWO clock is usually derived from the CPU clock — prescaled by TPIU->ACPR.

Practical setup

  1. Enable TRCENA in DEMCR.
  2. Configure ITM: set ITM_LAR=0xC5ACCE55, ITM_TCR=0x00010005 (ITMENA, SYNCENA).
  3. Configure TPIU: TPIU_SPPR=2 (NRZ SWO), TPIU_ACPR = core_hz/swo_hz − 1.
  4. Configure vendor's debug-mux to route SWO to a pin.
  5. Host: start SWO listener matching that baud.
13

Micro Trace Buffer (MTB) on Cortex-M0+

  • M0+ lacks ETM, but offers a small Micro Trace Buffer — last N branches stored in on-chip SRAM.
  • Typical size: 512 B – 4 KiB of SRAM reserved at boot.
  • Each "trace record" is 8 bytes: source PC + destination PC of one taken branch.
  • After a HardFault, the MTB content is the cheapest possible "last 256 branches" history — extremely useful forensics.
/* Reserve 1 KiB of SRAM for MTB */
#define MTB_BASE      (*(volatile uint32_t*)0xF0002000)
#define MTB_FLOW      (*(volatile uint32_t*)0xF0002008)
#define MTB_MASTER    (*(volatile uint32_t*)0xF000200C)

static uint8_t __attribute__((aligned(1024))) mtb_buf[1024];

void mtb_init(void) {
    MTB_BASE   = (uint32_t)mtb_buf;
    MTB_MASTER = (1 << 31) |          /* EN */
                 (10);                  /* MASK → 2^10 wrap */
}
14

CoreSight ROM Tables

  • At 0xE00F_F000 every Cortex-M has a ROM table — a list of pointers describing the debug components present.
  • Probe enumerates by walking the table; fingerprints the SoC.
  • Each component has a peripheral ID (ARM DDI 0314 CoreSight Architecture) — ITM = 0x001, DWT = 0x002, FPB = 0x003, etc.
  • This is how CMSIS-DAP / pyOCD auto-detect which features the target supports without device-specific code.

On the wire

When you plug in a new board and run pyocd list, the tool issues ~20 SWD reads to walk the ROM table and matches component IDs against its internal catalogue.

Writing a new driver? Start by dumping the ROM table. Everything interesting about the SoC's debug surface is described there.
15

Semihosting

  • A pre-CoreSight convention: use BKPT #0xAB as a system call to the debugger.
  • Debugger catches the breakpoint, inspects registers: R0 = syscall number, R1 = arg ptr.
  • Provides SYS_OPEN, SYS_WRITE, SYS_READ, SYS_TIME, SYS_EXIT etc.
  • Great for CI tests — firmware calls fopen("test.log","w") and the file lands on the host.
  • Terrible for production: blocks CPU for ~200 kcycle per call; firmware hangs if no debugger is attached.
/* Typical use in test firmware */
static inline int semihost(int op, void *arg)
{
    register int r0 __asm("r0") = op;
    register void *r1 __asm("r1") = arg;
    __asm volatile (
        "bkpt 0xAB" : "+r"(r0) : "r"(r1) : "memory");
    return r0;
}

/* SYS_WRITE0 = 0x04, arg = null-term string */
semihost(0x04, "Hello from the chip\n");
Always gate semihosting behind a "debugger attached" check (CoreDebug->DHCSR & C_DEBUGEN). Otherwise a field unit hangs forever at the first BKPT.
16

SEGGER Real-Time Transfer (RTT)

  • Pure-software alternative to ITM: a ring-buffer in SRAM that the debug probe polls over the DAP.
  • No extra pins, no special hardware — works on every Cortex-M, even M0.
  • Bi-directional: 8 up-channels and 8 down-channels.
  • Measured throughput: ~1 MB/s on a JLink over SWD.
  • "Fire-and-forget" log calls: ~10 cycles each, safe from any context including fault handlers.

Why RTT beats ITM for most teams

  • Works with any probe that can do DAP memory reads (JLink, OpenOCD, pyOCD).
  • No SWO pin or baud-rate configuration.
  • Identical semantics on M0 (no ITM) and M7 (has ITM).
  • Back-channel input ("enter a test command") without USB CDC.
Trade-off: RTT is busy-polled by the probe; ITM is hardware-driven into SWO. ITM has a little less jitter for time-sensitive traces.
17

Probes in the Wild

ProbeSWD / JTAGETM parallelHost tools
SEGGER J-Link✓ up to 50 MHzJ-Trace Pro (4-bit)Ozone, GDB, Keil, IAR
ST-Link V32-bit traceSTM32CubeIDE, OpenOCD
CMSIS-DAP / DAPLinkpyOCD, OpenOCD
Arm ULINKplus / proKeil µVision, Arm DS
Raspberry Pi Debug Probe✓ (CMSIS-DAP FW)OpenOCD, GDB
Lauterbach TRACE32✓ (full ETM)TRACE32 PowerView

CMSIS-DAP is Arm's open reference probe firmware — shipping on every eval board since 2017. "DAPLink" is Arm Mbed's productised version.

18

A Typical gdb + OpenOCD Session

# terminal 1
$ openocd -f interface/cmsis-dap.cfg -f target/stm32f4x.cfg
Open On-Chip Debugger 0.12.0
Info : CMSIS-DAP: SWD supported
Info : STM32F411CEUx found
Info : target halted due to debug-request, current mode: Thread
Info : Listening on port 3333 for gdb connections

# terminal 2
$ arm-none-eabi-gdb ./firmware.elf
(gdb) target extended-remote :3333
(gdb) load
Loading section .isr_vector, size 0x1c0 lma 0x8000000
Loading section .text,        size 0x41d8 lma 0x80001c0
Start address 0x08000198, load size 17448
Transfer rate: 12 KB/sec, 2181 bytes/write.
(gdb) monitor reset halt
(gdb) b main
(gdb) c
Breakpoint 1, main () at src/main.c:42
42    HAL_Init();
(gdb) info registers
r0 0x20000000   536870912
...
(gdb) monitor mww 0x4002103C 0x00000001   ; toggle a peripheral
(gdb) step
19

Trace Analysis Tools

Percepio Tracealyzer

Consumes custom RTOS hook output (FreeRTOS, Zephyr, ThreadX) via RTT/ITM; shows task scheduling, priority inversion, queue timing in a visual timeline. De-facto RTOS trace tool.

Arm Streamline

Consumes ITM PC-sampling + DWT exception-trace. Produces flame-graph CPU profile + per-function heat map. Shines with ETM.

SEGGER SystemView

Free companion to RTT-based firmware. Event-visualiser for RTOS calls; integrates with J-Link Ozone.

Bugs that breakpoints hide from you — priority-inversion races, ISR starvation, DMA-to-cache coherency regressions — are exactly the bugs a trace tool shows. Budget ETM/ITM pins on every new board.
20

Debug in Production

Why lock it down?

  • An attacker with SWD access can read firmware, extract keys, patch code.
  • SWD is on by default; pads are routed to test points on almost every board.

How

  • Readout Protection (RDP) — vendor-specific option byte; Level 2 permanently disables the DAP. STM32, NXP, Nordic all have this.
  • DAP_AUTH challenge-response — newer SoCs (STM32U5, nRF54) require signed token from the probe.
  • Secure-debug channel — in Armv8-M, DAUTHCTRL gates halting/debug per security state.

Side-channel reality

Setting RDP is not enough — glitching and fault-injection can revive the DAP on many designs. Cortex-M35P was purpose-built with hardware countermeasures; M33 parts advertise similar features as "attack detection".

In practice: ship with RDP1 (unlocked by mass-erase), document the unlock path internally. RDP2 only for "never debuggable again" production SKUs.
21

TrustZone Impact on Debug

  • v8-M cores add two per-state gates: Secure debug enable (SDE) and Non-Secure debug enable (NSDE).
  • If SDE=0, the probe cannot halt the CPU while it is executing Secure code. Memory reads of S regions are RAZ/WI.
  • This lets secure bootloader authors run RDP-like policies at the state boundary — useful for PSA attestation.
  • Halt/step from NS code still works when NSDE=1, even if S is locked. Developers of NS app firmware can still debug without unlocking S.

The Authenticated-Debug flow

  1. Probe reads a public challenge from the SoC.
  2. Developer signs challenge with a per-project key held by secure engineering.
  3. Probe replies with the signed token.
  4. On-chip secure HW verifies → raises SDE.

Implemented by Arm PSA authentic debug (ADAC); ship with M33 / M85 in many production SKUs.

22

Common Debug Bugs

1. SWD pin muxed as GPIO

A firmware glitch reconfigures SWCLK/SWDIO to GPIO → next debugger connect fails. Use the probe's "connect under reset" mode to recover.

2. Watchdog resets during halt

Halted CPU can't kick the watchdog. Configure the SoC's debug-freeze bit for IWDG/WWDG, or enable the IDE's "halt-suppresses-WD" option.

3. Cached D-cache hides reality

Memory read from probe goes through the DAP → often direct to RAM, bypassing the CPU cache. A variable shown as "stale" in the IDE is actually in the CPU's dirty cache line.

4. Low-power sleep kills the probe link

STOP mode gates the AHB-AP clock → SWD drops. Configure DBGMCU_CR to keep debug running in STOP/STANDBY during bring-up.

5. ITM output stops after reset

Re-enable TRCENA + LAR + TCR after every reset. It is easy to forget because the IDE may do it implicitly on attach, but cold-boot firmware must do it itself.

6. Breakpoints in flash-XIP code

Flash-XIP (QSPI) does not support SW breakpoint writes; you only get HW slots. Place critical test code in RAM during bring-up.

23

Production-Ready Debug Plan

  • Always wire up SWD (2 pins) to a test header — even on the smallest product.
  • Leave SWO available (one pin) unless the board is truly pin-starved.
  • Use RTT from day one; ITM for pin-scarce builds; ETM only when needed.
  • Keep a cycle-counted BSOD fault handler that dumps MSP/PSP, CFSR, HFSR, MMFAR, BFAR over RTT and the last MTB/ETM window over a predictable channel.
  • Set up the watchdog to freeze on debug during dev, run always in production.
  • Have a documented procedure to unlock RDP for returned units (or accept permanent lockdown).

Minimum CI setup

  1. Raspberry Pi 4 + CMSIS-DAP probe ($25 total).
  2. pyOCD flashes firmware, runs pytest-embedded.
  3. Test firmware uses semihosting for assertion results.
  4. Pass/fail propagates to GitHub Actions via exit code.

Total: < $40 of hardware per rack slot; catches regressions the instant they land.

24

References

ArmCoreSight Architecture Specification (ARM IHI 0029)
ArmArm Debug Interface (ADIv5, ADIv6) specifications (IHI 0031)
ArmCortex-M3 / M4 / M7 Technical Reference Manuals — chapters on DWT, ITM, FPB, ETM
ArmArmv8-M Debug Architecture extensions, Authenticated Debug (ADAC)
Joseph YiuDefinitive Guide to Cortex-M3/M4, Chapter 14 — debug and trace architecture
SEGGERApplication Note — Real-Time Transfer (RTT) and J-Link User Guide
pyOCD — python CoreSight driver library, github.com/pyocd/pyOCD
OpenOCD — open-source debug, github.com/openocd-org/openocd — best reference source for CoreSight protocols

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.