Target devices often run from CR2032 (225 mAh) for 5–10 years, or harvest μW-scale energy from solar/vibration.
The CPU is frequently awake < 1% of the time — everything depends on low-power idle.
Cortex-M spec provides a small set of architectural primitives: WFI, WFE, SLEEPDEEP, SLEEPONEXIT.
Vendors layer their own power modes (STOP, STANDBY, LPM, SYSTEMOFF, HIBERNATE) on top of those primitives.
Interview angle: articulate the split between what Arm standardises and what the SoC vendor adds.
Typical targets (2024)
Active: 20–80 μA/MHz on 40 nm.
Sleep w/ SRAM retain: 1–5 μA.
Deep-sleep RTC-only: 0.3–2 μA.
Ship mode / full power-off: < 50 nA.
Ambiq Apollo4+ hits ~6 μA/MHz active via subthreshold operation; Nordic nRF54L sits around 40 μA/MHz at 128 MHz on 22 nm.
03
Where the Energy Goes
CPU dynamic
55%
Flash access
20%
Clock tree / PLL
10%
Peripherals (I/O & analog)
10%
Leakage / BOR / LDO
5%
Active-mode energy split — typical 40 nm Cortex-M4 at 64 MHz. Leakage dominates in sleep modes because dynamic power collapses.
Dynamic: CV²f
Lower V, lower f → quadratic + linear saving. Reducing core voltage from 1.2 V to 0.9 V cuts dynamic by ~44%.
Leakage: f(V, T)
Rises exponentially with temperature and linearly with V. Below-threshold designs (Ambiq) get huge wins at the cost of frequency.
04
The Core's Sleep Primitives
Instruction
Effect
Wakeup condition
WFI
Wait-For-Interrupt. Stop fetching. Core clock may gate.
Any pending IRQ of ≥ current priority, reset, debug halt.
WFE
Wait-For-Event. Same gating, but wakes on either an interrupt or a "SEV" event.
IRQ, debug, SEV from another thread / core, peripheral event.
SEV
Send-Event. Sets the event register; a subsequent WFE returns immediately.
—
WFI vs WFE
WFE wakes from events as well as interrupts. Critical for lock-free primitives (LDREX/STREX) and for RTOS idle loops that want to spin without needing a real IRQ.
SLEEPONEXIT
Bit in SCB->SCR — on exception return to Thread mode, hardware automatically executes WFI. Useful for purely event-driven firmware: initialise, then live in ISRs forever, never waking in main().
05
Sleep vs Deep Sleep
Regular Sleep
SCB->SCR.SLEEPDEEP=0.
Core & NVIC clock gated; most peripherals keep running.
Wake latency: < 10 cycles.
Typical current: 500 μA – 5 mA depending on how much of the SoC stays up.
Deep Sleep
SCB->SCR.SLEEPDEEP=1.
Architecture only says "deeper" — the vendor defines what goes off (PLL, flash, analog, SRAM retention).
Wake latency: μs to ms.
Typical current: 0.3 – 10 μA.
void enter_deep_sleep(void)
{
SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk;
__DSB();
__WFI();
/* on wake, clock tree restarts;
code continues here */
}
The core instruction is the same — WFI. The difference is entirely in what the SoC's power-management block does when it sees CPU_SLEEPING && SLEEPDEEP.
06
The Wake-up Interrupt Controller
The WIC duplicates the NVIC's mask logic on a tiny always-on domain (~100 nA).
In deep sleep, the core and NVIC are clock-gated or power-gated. Their outputs are Xs electrically.
The WIC listens to the always-on IRQ lines; on any masked-in assertion, raises WAKEUP to the PMU.
PMU restarts clocks → CPU wakes → NVIC re-captures the IRQ → handler runs normally.
Without the WIC, you cannot power-gate the NVIC and keep IRQ wake; getting down to sub-μA requires WIC support.
07
Retention
What can be retained in sleep
CPU state — GPRs, xPSR, SP, control — stored in a handful of retention flops on the always-on domain (auto-handled by the core in some sleep modes).
NVIC state — priorities, pending/enabled masks.
SRAM — "retention SRAM" banks can be kept powered while the rest is collapsed. STM32 typically has 1–4 retention banks configurable.
Backup registers — a handful of 32-bit words on the VBAT/RTC domain, persist across the CPU powering off entirely.
What you give up per level
Mode
CPU
SRAM
Flash
Typ. μA
Sleep
✓
✓
✓
500–2000
Low-power sleep
✓ (slow)
✓
off
50–200
Stop (retained)
off
✓
off
1–10
Standby
off
partial
off
0.3–2
Shutdown
off
off
off
< 0.1
08
Tickless RTOS
Classical RTOS wakes every 1 ms via SysTick → thousands of useless wake-ups per hour.
Tickless: replace fixed SysTick with an RTC-driven timer. Scheduler programs the next needed wake directly.
Requires:
Idle callback: "sleep for up to N ticks".
Vendor RTC / low-power timer with a 32 kHz xtal.
On wake, correct the tick count by elapsed RTC time.
/* FreeRTOS: enable tickless */
#define configUSE_TICKLESS_IDLE 1
#define configEXPECTED_IDLE_TIME_BEFORE_SLEEP 2
/* Vendor overrides vPortSuppressTicksAndSleep
to replace SysTick with RTC-WUT */
extern void vPortSuppressTicksAndSleep(TickType_t);
/* Pseudo-flow in the port */
void vPortSuppressTicksAndSleep(TickType_t xExpected)
{
rtc_program_wake(xExpected);
enter_stop_mode(); /* SLEEPDEEP + WFI */
TickType_t slept = rtc_elapsed_and_clear();
vTaskStepTick(slept);
}
Tickless RTOS routinely cuts idle power by 10–30× for event-driven workloads. It is the single highest-impact firmware change in an IoT product.
09
DVFS on Cortex-M
Cortex-M itself has no DVFS table — frequency/voltage scaling is entirely the SoC's job.
Typical scheme:
Boot at slow HSI (internal oscillator, e.g. 16 MHz).
Enable PLL → run at 80 / 100 / 168 MHz during active work.
Drop back to HSI before WFI.
Scale Vcore via a vendor VOS register (STM32: PWR_VOS0..VOS3).
Changing voltage is slow (ms). Changing clock source is fast (μs). The ratio drives most workflows.
Race to sleep vs race to idle
Race to sleep — fastest feasible active, then deep sleep. Wins for burst-y workloads on newer silicon where active power ≈ linear in f and leakage dominates sleep.
Race to idle (at lower f) — run slowest that still hits the deadline, avoid voltage spike. Wins on older silicon where dynamic power dominates and sleep is "meh".
10
Clock Gating by Peripheral
Every Cortex-M SoC has "RCC / CMU / CKG" registers — one enable bit per peripheral clock branch.
A disabled peripheral draws leakage only — sometimes a couple of nA.
Correct pattern:
Clock-gate peripheral.
Touch its registers → vendor-defined behaviour (STM32: reads RAZ, writes WI until clock re-enabled).
In RTOS, attach a "peripheral use-count" on each clock branch: enable on first user, disable on last.
/* STM32H7 — enable then disable UART2 clock */
RCC->APB1LENR |= RCC_APB1LENR_USART2EN;
__DSB();
/* … work with UART2 … */
RCC->APB1LENR &= ~RCC_APB1LENR_USART2EN;
/* Nordic nRF52 — per-peripheral in PMU */
NRF_UARTE0->ENABLE = UARTE_ENABLE_ENABLE_Disabled;
The clock-gate bit is not the same as the peripheral-enable bit. Both are usually necessary.
11
Vendor Sleep Modes — STM32 Family
Mode
CPU clock
Flash
PLL
SRAM
Wake sources
Current (typ L5)
Run
on
on
on
on
—
1–10 mA
Sleep
off (core)
on
on
on
any IRQ
500 μA – 2 mA
Low-power Run
MSI @ 2 MHz
on
off
on
any IRQ
100–300 μA
STOP0
off
retained
off
all
WIC-IRQs + WUTS
60–100 μA
STOP1
off
off
off
most
EXTI, RTC, LPUART
1–3 μA
STOP2
off
off
off
fewer banks
EXTI, RTC, LPUART
300–900 nA
STANDBY
off
off
off
backup only
RTC/WKUP pin
80–300 nA
SHUTDOWN
off
off
off
off
WKUP pin
< 50 nA
Each mode is entered via a specific sequence of PWR register writes, then SCB->SCR.SLEEPDEEP + WFI. Recovery paths differ — STANDBY returns at reset, STOP returns to the WFI instruction.
12
Nordic nRF52 / nRF54 Model
System ON — normal running; idle via WFI. SoC automatically gates peripherals whose owner driver has disabled them. Typical: 5–20 μA with RTC + BLE PHY off, few kHz from the RTC.
System OFF — SoC powered down apart from RAM retention + RESET/WAKEUP pins. ~0.4 μA.
Constant-Latency submode — forces HFCLK running; avoids PLL restart jitter for audio or radio.
Radio (PHY + baseband) & TIMERs are part of the power-aware event-driven architecture — PPI (Programmable Peripheral Interconnect) lets them operate without CPU wake.
PPI / DPPI — "CPU not required"
Nordic MCUs can transfer data and trigger peripherals without waking the CPU. Example: low-power comparator → GPIOTE → TIMER → SAADC → RAM buffer, all in System ON, CPU asleep until the sample buffer is full.
This architecture is why Nordic's BLE peripheral current averages < 8 μA at 1 Hz connection intervals — the CPU is off most of the time.
13
Ambiq — Subthreshold Cortex-M
Ambiq Apollo family runs the Cortex-M core at Vcore ≈ 0.5 V — below the classical threshold voltage.
Requires custom SRAM & flash designed to operate at that voltage, plus variation-tolerant logic libraries.
Result: ~6 μA/MHz active (Apollo4 Plus, 2022), vs 40–80 μA/MHz for conventional silicon on the same process.
Clock typically 96–192 MHz — subthreshold imposes a speed ceiling.
Used in Apple Watch S8 onwards (M4 Apollo3 Blue Plus derivative) and countless hearing aids / fitness bands.
Why this is not universal
Subthreshold operation breaks a lot of assumptions the rest of the IP toolchain makes — needs custom memory compilers, variation-aware STA, different fab processes. Arm licenses the architectural core; Ambiq does the rest.
Observation: for a given architecture, power per MHz is the single biggest vendor differentiator. ST, NXP and Ambiq can ship the "same" Cortex-M4 and have 10× different μA/MHz.
14
EEMBC ULPMark & Benchmark Culture
ULPMark-CP (Core Profile) — wakes the MCU for 1 s of "light" work, sleeps 9 s, repeats. Measures energy per 1000 work-cycles.
ULPMark-PP (Peripheral Profile) — exercises analog and digital peripherals in a realistic IoT duty cycle.
Same CoreMark integer benchmark as everywhere, divided by measured active current.
Rewards fast-finishing cores with low μA/MHz.
An M7 at 480 MHz can beat a 32 MHz M0+ on CoreMark/mA — because although it draws more current, it finishes in 15× less time.
Lesson
If your workload is bursty and the platform supports fast sleep entry, faster cores can be lower-power. The classic "save μA by running slower" intuition is often backwards on modern silicon.
Bosch: shipping IMUs using an M33 @ 120 MHz rather than M0+ @ 16 MHz precisely because the M33 races through its Kalman filter in 50 μs and sleeps for the remaining 950 μs per 1-kHz tick.
16
Tickless Gotchas
1. Missed wake during critical section
If you disable interrupts and then enter WFI, an IRQ can't wake you (unless the WIC still sees it). Most FreeRTOS ports use __disable_irq() before WFI and count on the WIC.
2. RTC drift
32 kHz crystals drift ±20 ppm. Over a 10 s deep-sleep interval, that's up to 200 μs of error. Protocol timers (BLE, LoRa) must budget for this.
3. Pending IRQ missed on sleep entry
Race: IRQ arrives just before WFI executes. v7-M architecture says WFI returns immediately if any IRQ is pending — but some SoCs latch the "about to sleep" state a cycle earlier. Use the "SEVONPEND" trick + WFE to guarantee.
4. DMA still running into deep sleep
Entering STOP while a DMA is programmed can abort it. Flush/abort DMAs before deep-sleep entry.
5. Flash wake-up latency
Flash macros take tens of μs to restart. First ISR after wake stalls in flash-wait. Move critical handlers into RAM (.ramfunc) — every vendor has a macro for this.
6. SysTick still running
Classic: tickless mode entered, but SCB->SCR.SLEEPDEEP left at 0 → you get Sleep not Stop. Check with a current probe. Always.
17
Debug While Sleeping
Entering deep sleep gates the DAP clock — debugger link drops.
Vendor "debug-monitor" bits keep the DAP alive in sleep modes for development:
STM32: DBGMCU_CR.DBG_STOP = 1, DBG_STANDBY = 1.
Nordic: POWER_DEBUG.ENABLE.
NXP: DEBUG_SAFE bit.
Note: enabling these disables the sleep power reduction. Power measurements must be done with debug-in-sleep disabled.
Typical bring-up rhythm
Enable debug-in-sleep. Develop the firmware flow.
Disable debug-in-sleep. Take power measurements (Joulescope, N6705, Otii).
Iterate. Confirm with a power profile that every non-active cycle is in the lowest acceptable mode.
For returned field units: you cannot debug a running unit that has DBG_STOP=0. Plan for logged telemetry over RTT or a disposable dev-mode flag.
18
Event-Driven Architecture
19
Common Bad Patterns
1. Busy-wait delays
for (volatile int i=0; i<N; i++); holds the CPU 100% active. Replace with __WFI + RTC wake or vendor LPTIM.
2. Bit-banging protocols
Using GPIO in software for UART / SPI burns far more energy than the hardware peripheral. Only acceptable if the peripheral doesn't exist.
3. printf in ISR
A single printf on 115200 8N1 UART locks the CPU for ~1 ms. In a 1 kHz ISR that means 100% CPU; no sleep ever happens.
4. Heap allocation in tight loops
malloc/free not only hurt determinism — they light up lots of logic. Pre-allocate; use static buffers.
5. Leaving peripherals enabled
ADC left powered, comparator left biased — each costs μA. Disable after every use and re-enable on need.
6. Long debug strings
Every ITM/RTT byte is CPU cycles. Production builds should log binary event codes and resolve with a tool post-hoc.
If we cut the active time to 2 ms (faster clock, pre-triggered ADC via DMA + PPI):
I_avg = (0.002 × 8000 + 0.998 × 2) / 1
≈ 18 μA
life = 225 / 0.018 = 12500 h ≈ 1.4 years
Same MCU. Same sleep current. 2× battery life from shortening the active window.
22
Interview Checklist
Explain the difference between WFI and WFE and when each is correct.
Describe the role of the WIC — why it's needed for sub-μA sleep.
Walk through a tickless RTOS sleep: programming RTC, entering STOP, recovering the tick count.
Justify "race-to-sleep" vs "race-to-idle" in two sentences, specific to a given SoC generation.
Give 3 reasons a power trace might show unexpected active μA you didn't intend.
Know which SoC registers freeze watchdogs & timers on halt (DBGMCU_CR / PMU).
Explain how SLEEPONEXIT interacts with main() — and why some codebases never use main() at all.
Describe how Nordic / STM32 / Ambiq each layer extra sleep states on top of the Cortex-M primitives.
Know EEMBC ULPMark-CP vs -PP — what's measured, why they differ.
Compute a CR2032 battery life from duty cycle & active current.
23
References
Arm — Cortex-M Low-Power Architectures — part of the Armv7-M / Armv8-M reference manuals Arm — Low Power Features in Cortex-M0 / M0+ / M23 / M33 technical manuals STMicro — AN5239 (STM32L5 low-power modes), AN4635 (STM32H7 low-power) Nordic — nRF52/nRF53/nRF54 System Architecture chapters; PPI/DPPI reference Ambiq — Apollo4 Plus datasheet; Ambiq subthreshold white papers EEMBC — ULPMark Profile Specifications and benchmark browser at eembc.org FreeRTOS — Low Power RTOS (Tickless Idle) documentation and reference ports Klaus Finkenzeller (for RFID) & Adrian Wyatt (Nordic, BLE) talks on low-power protocol-stack design
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.