ARM CORTEX-M · PRESENTATION 07

Low-Power Design

Sleep modes · WFI / WFE · WIC · Tickless RTOS · ULPMark
How a coin-cell lasts ten years
02

Why Low Power

  • Target devices often run from CR2032 (225 mAh) for 5–10 years, or harvest μW-scale energy from solar/vibration.
  • The CPU is frequently awake < 1% of the time — everything depends on low-power idle.
  • Cortex-M spec provides a small set of architectural primitives: WFI, WFE, SLEEPDEEP, SLEEPONEXIT.
  • Vendors layer their own power modes (STOP, STANDBY, LPM, SYSTEMOFF, HIBERNATE) on top of those primitives.
  • Interview angle: articulate the split between what Arm standardises and what the SoC vendor adds.

Typical targets (2024)

  • Active: 20–80 μA/MHz on 40 nm.
  • Sleep w/ SRAM retain: 1–5 μA.
  • Deep-sleep RTC-only: 0.3–2 μA.
  • Ship mode / full power-off: < 50 nA.
Ambiq Apollo4+ hits ~6 μA/MHz active via subthreshold operation; Nordic nRF54L sits around 40 μA/MHz at 128 MHz on 22 nm.
03

Where the Energy Goes

CPU dynamic
55%
Flash access
20%
Clock tree / PLL
10%
Peripherals (I/O & analog)
10%
Leakage / BOR / LDO
5%

Active-mode energy split — typical 40 nm Cortex-M4 at 64 MHz. Leakage dominates in sleep modes because dynamic power collapses.

Dynamic: CV²f

Lower V, lower f → quadratic + linear saving. Reducing core voltage from 1.2 V to 0.9 V cuts dynamic by ~44%.

Leakage: f(V, T)

Rises exponentially with temperature and linearly with V. Below-threshold designs (Ambiq) get huge wins at the cost of frequency.

04

The Core's Sleep Primitives

InstructionEffectWakeup condition
WFIWait-For-Interrupt. Stop fetching. Core clock may gate.Any pending IRQ of ≥ current priority, reset, debug halt.
WFEWait-For-Event. Same gating, but wakes on either an interrupt or a "SEV" event.IRQ, debug, SEV from another thread / core, peripheral event.
SEVSend-Event. Sets the event register; a subsequent WFE returns immediately.

WFI vs WFE

WFE wakes from events as well as interrupts. Critical for lock-free primitives (LDREX/STREX) and for RTOS idle loops that want to spin without needing a real IRQ.

SLEEPONEXIT

Bit in SCB->SCR — on exception return to Thread mode, hardware automatically executes WFI. Useful for purely event-driven firmware: initialise, then live in ISRs forever, never waking in main().

05

Sleep vs Deep Sleep

Regular Sleep

  • SCB->SCR.SLEEPDEEP=0.
  • Core & NVIC clock gated; most peripherals keep running.
  • Wake latency: < 10 cycles.
  • Typical current: 500 μA – 5 mA depending on how much of the SoC stays up.

Deep Sleep

  • SCB->SCR.SLEEPDEEP=1.
  • Architecture only says "deeper" — the vendor defines what goes off (PLL, flash, analog, SRAM retention).
  • Wake latency: μs to ms.
  • Typical current: 0.3 – 10 μA.
void enter_deep_sleep(void)
{
    SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk;
    __DSB();
    __WFI();
    /* on wake, clock tree restarts;
       code continues here */
}
The core instruction is the sameWFI. The difference is entirely in what the SoC's power-management block does when it sees CPU_SLEEPING && SLEEPDEEP.
06

The Wake-up Interrupt Controller

Core + NVIC gated in deep sleep 0 mA WIC clocked at 32 kHz ~100 nA Always-on IRQs RTC · GPIO EXTI · LPUART Low-power comparator wakeup req
  • The WIC duplicates the NVIC's mask logic on a tiny always-on domain (~100 nA).
  • In deep sleep, the core and NVIC are clock-gated or power-gated. Their outputs are Xs electrically.
  • The WIC listens to the always-on IRQ lines; on any masked-in assertion, raises WAKEUP to the PMU.
  • PMU restarts clocks → CPU wakes → NVIC re-captures the IRQ → handler runs normally.
  • Without the WIC, you cannot power-gate the NVIC and keep IRQ wake; getting down to sub-μA requires WIC support.
07

Retention

What can be retained in sleep

  • CPU state — GPRs, xPSR, SP, control — stored in a handful of retention flops on the always-on domain (auto-handled by the core in some sleep modes).
  • NVIC state — priorities, pending/enabled masks.
  • SRAM — "retention SRAM" banks can be kept powered while the rest is collapsed. STM32 typically has 1–4 retention banks configurable.
  • Backup registers — a handful of 32-bit words on the VBAT/RTC domain, persist across the CPU powering off entirely.

What you give up per level

ModeCPUSRAMFlashTyp. μA
Sleep500–2000
Low-power sleep✓ (slow)off50–200
Stop (retained)offoff1–10
Standbyoffpartialoff0.3–2
Shutdownoffoffoff< 0.1
08

Tickless RTOS

  • Classical RTOS wakes every 1 ms via SysTick → thousands of useless wake-ups per hour.
  • Tickless: replace fixed SysTick with an RTC-driven timer. Scheduler programs the next needed wake directly.
  • Requires:
    • Idle callback: "sleep for up to N ticks".
    • Vendor RTC / low-power timer with a 32 kHz xtal.
    • On wake, correct the tick count by elapsed RTC time.
/* FreeRTOS: enable tickless */
#define configUSE_TICKLESS_IDLE          1
#define configEXPECTED_IDLE_TIME_BEFORE_SLEEP 2

/* Vendor overrides vPortSuppressTicksAndSleep
   to replace SysTick with RTC-WUT */
extern void vPortSuppressTicksAndSleep(TickType_t);

/* Pseudo-flow in the port */
void vPortSuppressTicksAndSleep(TickType_t xExpected)
{
    rtc_program_wake(xExpected);
    enter_stop_mode();        /* SLEEPDEEP + WFI */
    TickType_t slept = rtc_elapsed_and_clear();
    vTaskStepTick(slept);
}
Tickless RTOS routinely cuts idle power by 10–30× for event-driven workloads. It is the single highest-impact firmware change in an IoT product.
09

DVFS on Cortex-M

  • Cortex-M itself has no DVFS table — frequency/voltage scaling is entirely the SoC's job.
  • Typical scheme:
    • Boot at slow HSI (internal oscillator, e.g. 16 MHz).
    • Enable PLL → run at 80 / 100 / 168 MHz during active work.
    • Drop back to HSI before WFI.
    • Scale Vcore via a vendor VOS register (STM32: PWR_VOS0..VOS3).
  • Changing voltage is slow (ms). Changing clock source is fast (μs). The ratio drives most workflows.

Race to sleep vs race to idle

  • Race to sleep — fastest feasible active, then deep sleep. Wins for burst-y workloads on newer silicon where active power ≈ linear in f and leakage dominates sleep.
  • Race to idle (at lower f) — run slowest that still hits the deadline, avoid voltage spike. Wins on older silicon where dynamic power dominates and sleep is "meh".
10

Clock Gating by Peripheral

  • Every Cortex-M SoC has "RCC / CMU / CKG" registers — one enable bit per peripheral clock branch.
  • A disabled peripheral draws leakage only — sometimes a couple of nA.
  • Correct pattern:
    1. Clock-gate peripheral.
    2. Touch its registers → vendor-defined behaviour (STM32: reads RAZ, writes WI until clock re-enabled).
  • In RTOS, attach a "peripheral use-count" on each clock branch: enable on first user, disable on last.
/* STM32H7 — enable then disable UART2 clock */
RCC->APB1LENR |=  RCC_APB1LENR_USART2EN;
__DSB();
/* … work with UART2 … */
RCC->APB1LENR &= ~RCC_APB1LENR_USART2EN;

/* Nordic nRF52 — per-peripheral in PMU */
NRF_UARTE0->ENABLE = UARTE_ENABLE_ENABLE_Disabled;

The clock-gate bit is not the same as the peripheral-enable bit. Both are usually necessary.

11

Vendor Sleep Modes — STM32 Family

ModeCPU clockFlashPLLSRAMWake sourcesCurrent (typ L5)
Runonononon1–10 mA
Sleepoff (core)onononany IRQ500 μA – 2 mA
Low-power RunMSI @ 2 MHzonoffonany IRQ100–300 μA
STOP0offretainedoffallWIC-IRQs + WUTS60–100 μA
STOP1offoffoffmostEXTI, RTC, LPUART1–3 μA
STOP2offoffofffewer banksEXTI, RTC, LPUART300–900 nA
STANDBYoffoffoffbackup onlyRTC/WKUP pin80–300 nA
SHUTDOWNoffoffoffoffWKUP pin< 50 nA

Each mode is entered via a specific sequence of PWR register writes, then SCB->SCR.SLEEPDEEP + WFI. Recovery paths differ — STANDBY returns at reset, STOP returns to the WFI instruction.

12

Nordic nRF52 / nRF54 Model

  • System ON — normal running; idle via WFI. SoC automatically gates peripherals whose owner driver has disabled them. Typical: 5–20 μA with RTC + BLE PHY off, few kHz from the RTC.
  • System OFF — SoC powered down apart from RAM retention + RESET/WAKEUP pins. ~0.4 μA.
  • Constant-Latency submode — forces HFCLK running; avoids PLL restart jitter for audio or radio.
  • Radio (PHY + baseband) & TIMERs are part of the power-aware event-driven architecture — PPI (Programmable Peripheral Interconnect) lets them operate without CPU wake.

PPI / DPPI — "CPU not required"

Nordic MCUs can transfer data and trigger peripherals without waking the CPU. Example: low-power comparator → GPIOTE → TIMER → SAADC → RAM buffer, all in System ON, CPU asleep until the sample buffer is full.

This architecture is why Nordic's BLE peripheral current averages < 8 μA at 1 Hz connection intervals — the CPU is off most of the time.
13

Ambiq — Subthreshold Cortex-M

  • Ambiq Apollo family runs the Cortex-M core at Vcore ≈ 0.5 V — below the classical threshold voltage.
  • Requires custom SRAM & flash designed to operate at that voltage, plus variation-tolerant logic libraries.
  • Result: ~6 μA/MHz active (Apollo4 Plus, 2022), vs 40–80 μA/MHz for conventional silicon on the same process.
  • Clock typically 96–192 MHz — subthreshold imposes a speed ceiling.
  • Used in Apple Watch S8 onwards (M4 Apollo3 Blue Plus derivative) and countless hearing aids / fitness bands.

Why this is not universal

Subthreshold operation breaks a lot of assumptions the rest of the IP toolchain makes — needs custom memory compilers, variation-aware STA, different fab processes. Arm licenses the architectural core; Ambiq does the rest.

Observation: for a given architecture, power per MHz is the single biggest vendor differentiator. ST, NXP and Ambiq can ship the "same" Cortex-M4 and have 10× different μA/MHz.
14

EEMBC ULPMark & Benchmark Culture

  • ULPMark-CP (Core Profile) — wakes the MCU for 1 s of "light" work, sleeps 9 s, repeats. Measures energy per 1000 work-cycles.
  • ULPMark-PP (Peripheral Profile) — exercises analog and digital peripherals in a realistic IoT duty cycle.
  • ULPMark-SP — security primitive profile (AES, SHA, RNG).
  • Vendor datasheets quote these heavily because pure μA/MHz is only half the story — how quickly you can get to sleep matters just as much.

Example numbers (2024)

MCUULPMark-CP
Apollo4 Plus (M4F)~1320
STM32U5 (M33)~410
nRF54L (M33)~370
STM32L5 (M33)~370
STM32L4 (M4)~215

Higher = better. Apollo4's subthreshold design dominates.

15

CoreMark / mA — Active Efficiency

  • Same CoreMark integer benchmark as everywhere, divided by measured active current.
  • Rewards fast-finishing cores with low μA/MHz.
  • An M7 at 480 MHz can beat a 32 MHz M0+ on CoreMark/mA — because although it draws more current, it finishes in 15× less time.

Lesson

If your workload is bursty and the platform supports fast sleep entry, faster cores can be lower-power. The classic "save μA by running slower" intuition is often backwards on modern silicon.

Bosch: shipping IMUs using an M33 @ 120 MHz rather than M0+ @ 16 MHz precisely because the M33 races through its Kalman filter in 50 μs and sleeps for the remaining 950 μs per 1-kHz tick.
16

Tickless Gotchas

1. Missed wake during critical section

If you disable interrupts and then enter WFI, an IRQ can't wake you (unless the WIC still sees it). Most FreeRTOS ports use __disable_irq() before WFI and count on the WIC.

2. RTC drift

32 kHz crystals drift ±20 ppm. Over a 10 s deep-sleep interval, that's up to 200 μs of error. Protocol timers (BLE, LoRa) must budget for this.

3. Pending IRQ missed on sleep entry

Race: IRQ arrives just before WFI executes. v7-M architecture says WFI returns immediately if any IRQ is pending — but some SoCs latch the "about to sleep" state a cycle earlier. Use the "SEVONPEND" trick + WFE to guarantee.

4. DMA still running into deep sleep

Entering STOP while a DMA is programmed can abort it. Flush/abort DMAs before deep-sleep entry.

5. Flash wake-up latency

Flash macros take tens of μs to restart. First ISR after wake stalls in flash-wait. Move critical handlers into RAM (.ramfunc) — every vendor has a macro for this.

6. SysTick still running

Classic: tickless mode entered, but SCB->SCR.SLEEPDEEP left at 0 → you get Sleep not Stop. Check with a current probe. Always.

17

Debug While Sleeping

  • Entering deep sleep gates the DAP clock — debugger link drops.
  • Vendor "debug-monitor" bits keep the DAP alive in sleep modes for development:
    • STM32: DBGMCU_CR.DBG_STOP = 1, DBG_STANDBY = 1.
    • Nordic: POWER_DEBUG.ENABLE.
    • NXP: DEBUG_SAFE bit.
  • Note: enabling these disables the sleep power reduction. Power measurements must be done with debug-in-sleep disabled.

Typical bring-up rhythm

  1. Enable debug-in-sleep. Develop the firmware flow.
  2. Disable debug-in-sleep. Take power measurements (Joulescope, N6705, Otii).
  3. Iterate. Confirm with a power profile that every non-active cycle is in the lowest acceptable mode.
  4. For returned field units: you cannot debug a running unit that has DBG_STOP=0. Plan for logged telemetry over RTT or a disposable dev-mode flag.
18

Event-Driven Architecture

RTC tick / BLE evt / sensor hardware event source Wake CPU WIC asserts wake ISR quickly handle + post Post + WFI SLEEPONEXIT auto-sleep No polling. No busy-wait. main() is a single WFI. SLEEPONEXIT=1 → on ISR return hardware executes WFI automatically.
19

Common Bad Patterns

1. Busy-wait delays

for (volatile int i=0; i<N; i++); holds the CPU 100% active. Replace with __WFI + RTC wake or vendor LPTIM.

2. Bit-banging protocols

Using GPIO in software for UART / SPI burns far more energy than the hardware peripheral. Only acceptable if the peripheral doesn't exist.

3. printf in ISR

A single printf on 115200 8N1 UART locks the CPU for ~1 ms. In a 1 kHz ISR that means 100% CPU; no sleep ever happens.

4. Heap allocation in tight loops

malloc/free not only hurt determinism — they light up lots of logic. Pre-allocate; use static buffers.

5. Leaving peripherals enabled

ADC left powered, comparator left biased — each costs μA. Disable after every use and re-enable on need.

6. Long debug strings

Every ITM/RTT byte is CPU cycles. Production builds should log binary event codes and resolve with a tool post-hoc.

20

Measuring — The Tooling You Need

  • Joulescope JS110 / JS220 — 15-bit μA .. A dynamic range, captures sleep-to-active transitions.
  • Keysight N6705 / N6785A — bench SMU with waveform logging.
  • Qoitech Otii / Nordic PPK2 — purpose-built MCU current profilers with UART side-channel.
  • DWT CYCCNT + a logic analyzer — cheap way to correlate firmware state with external measurement.

What to look for on the trace

  • Unexpected active-mode plateaus (should be flat sleep).
  • Frequent spikes at unintended rates (SysTick at 1 kHz that should be off).
  • Long wake windows (flash wait, PLL lock taking longer than expected).
  • Asymmetric loop — energy budget per cycle should be predictable.
Correlate the current trace with ITM/RTT timestamps. Every unexplained μA is a TODO.
21

Battery-Life Math — Worked Example

Sensor wakes every 1 s, samples ADC for 5 ms, then deep-sleeps.

PhasetI
Active (48 MHz)5 ms8 mA
Deep sleep (RTC)995 ms2 μA

Average current:

I_avg = (0.005 × 8000 + 0.995 × 2) / 1.000
      = 40 + 1.99 μA
      ≈ 42 μA

CR2032 @ 225 mAh →

life = 225 mAh / 0.042 mA
     = 5357 h
     ≈ 7.4 months

If we cut the active time to 2 ms (faster clock, pre-triggered ADC via DMA + PPI):

I_avg = (0.002 × 8000 + 0.998 × 2) / 1
      ≈ 18 μA

life = 225 / 0.018 = 12500 h ≈ 1.4 years
Same MCU. Same sleep current. 2× battery life from shortening the active window.
22

Interview Checklist

  • Explain the difference between WFI and WFE and when each is correct.
  • Describe the role of the WIC — why it's needed for sub-μA sleep.
  • Walk through a tickless RTOS sleep: programming RTC, entering STOP, recovering the tick count.
  • Justify "race-to-sleep" vs "race-to-idle" in two sentences, specific to a given SoC generation.
  • Give 3 reasons a power trace might show unexpected active μA you didn't intend.
  • Know which SoC registers freeze watchdogs & timers on halt (DBGMCU_CR / PMU).
  • Explain how SLEEPONEXIT interacts with main() — and why some codebases never use main() at all.
  • Describe how Nordic / STM32 / Ambiq each layer extra sleep states on top of the Cortex-M primitives.
  • Know EEMBC ULPMark-CP vs -PP — what's measured, why they differ.
  • Compute a CR2032 battery life from duty cycle & active current.
23

References

ArmCortex-M Low-Power Architectures — part of the Armv7-M / Armv8-M reference manuals
ArmLow Power Features in Cortex-M0 / M0+ / M23 / M33 technical manuals
STMicroAN5239 (STM32L5 low-power modes), AN4635 (STM32H7 low-power)
Nordic — nRF52/nRF53/nRF54 System Architecture chapters; PPI/DPPI reference
Ambiq — Apollo4 Plus datasheet; Ambiq subthreshold white papers
EEMBCULPMark Profile Specifications and benchmark browser at eembc.org
FreeRTOSLow Power RTOS (Tickless Idle) documentation and reference ports
Klaus Finkenzeller (for RFID) & Adrian Wyatt (Nordic, BLE) talks on low-power protocol-stack design

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.