ARM CORTEX-M · PRESENTATION 05

DSP, FPU & Helium (MVE)

Saturating Arithmetic · FPv4 / FPv5 · 128-bit Vector · CMSIS-DSP · CMSIS-NN
From M4 SIMD to M85 TinyML
02

Why Cortex-M Cares About Signal Processing

  • Modern MCUs run 48 kHz audio codecs, 250 kHz motor-control current loops, BLE radio baseband post-processing, always-on KWS (keyword spotting) NNs.
  • Until 2010, these needed either a second chip (TI C28x DSP, ADSP Blackfin) or a bigger A-profile.
  • Adding a targeted set of SIMD & saturating instructions let a Cortex-M4 displace all of that for the mid-range.
  • In 2020 Arm did it again with Helium (MVE) — putting 128-bit vector execution on an MCU for edge-ML.

Three tiers today

  • DSP extension (v7E-M / v8-M Main) — 32-bit SIMD on M4, M7, M33, M35P, M55, M85.
  • FPU — optional IEEE 754 single- and double-precision on M4 / M7 / M33 / M52 / M55 / M85.
  • Helium (v8.1-M MVE) — 128-bit beat-wise SIMD on M52, M55, M85. First Arm vector extension outside A-profile.
03

The DSP Extension — Packed-Operand SIMD

32-bit GPRs are already on the core. The DSP extension reinterprets them as packed vectors and adds instructions that operate on all lanes at once.

32 bits:
int32
2 × 16:
int16
int16
4 × 8:
i8
i8
i8
i8
  • No new register file — reuses R0–R12.
  • Adds ~30 instructions: QADD8, QSUB8, SADD16, SSUB16, QADD16, SMLAD, SMLSD, SMMUL, SMMLA, SEL, USAT, SSAT, PKHBT, PKHTB, UXTAB, SXTB16, …
  • Saturating variants clip to type limits (±32767 for int16 etc.) instead of wrapping — and set the sticky APSR.Q flag.
  • Total cost: ~5 kgate in the ALU datapath.
04

SIMD Instructions — Reference

InstructionOperationLanesTypical use
SADD16 / SSUB16signed add/sub2 × int16FIR half-word addition
QADD16 / QSUB16saturating add/sub2 × int16Audio clipping-safe mix
SADD8 / QADD8add / saturating add4 × int8Image-processing blend
SMUL/SMLAD/SMLSDsigned mult-acc of two 16-bit pairs2 × int16FIR / IIR kernel inner loop
SMLAL / SMLALDsigned mult-acc, 64-bit accumulator2 × int16Large accumulators for long FIRs
SMMUL / SMMLAsigned <<32 multiply high-word1 × Q31Q31 fixed-point IIR
USAT / SSATsaturate scalar to N-bit1 × 32Post-multiply saturation to Q15 / Q7
SELbyte-wise select using GE flags4 × int8Conditional pack after compare
PKHBT / PKHTBpack bottom+top halfwords2 × int16Reassembling after shifted ops
UXTB16 / SXTB16unpack two bytes → two halfwords2 × int8→int16Load 4 bytes, unpack for FIR
05

DSP Example — Q15 FIR Kernel

/* coeffs packed 2×Q15 per word;
   samples packed 2×Q15 per word.
   Produces acc in Q31. */
q31_t acc = 0;
for (i = 0; i < numTaps/2; i++) {
    q31_t s = *(q31_t *)(pS + 2*i);      /* s1 | s0 */
    q31_t c = *(q31_t *)(pC + 2*i);      /* c1 | c0 */
    acc = __SMLALD(s, c, acc);           /* acc += s0*c0 + s1*c1 */
}
acc = __SSAT(acc >> 15, 16);
  • Each loop iteration now does two MACs per cycle.
  • Compare to scalar C:
    for (i=0;i<N;i++) acc += s[i]*c[i];
    which is 1 MAC per iteration + loop overhead.
  • Measured speed-up on STM32F4: ~3.5× for a 64-tap FIR.
  • CMSIS-DSP's arm_fir_q15 uses this pattern with further unrolling.
06

Saturating Arithmetic & the Q Flag

  • APSR.Q — a sticky bit set by any saturating op that actually clipped.
  • Reset only by writing 0 with MSR APSR_nzcvq.
  • Lets an algorithm check at end-of-frame whether any sample saturated — warn or reduce gain.
  • Hardware saturation is free (no conditional branch in the inner loop).
/* check overall saturation for a block */
__set_APSR_nzcvq(0);              /* clear Q */
for (i=0; i<N; i++)
    out[i] = __QADD16(a[i], b[i]);

if (__get_APSR() & (1U << 27))    /* Q flag */
    gain_control_reduce();
Audio DSP loves this. It maps closely to DSP culture from 90s Motorola 56k processors — Arm deliberately modelled the feature on that experience.
07

The FPU — FPv4 / FPv5

VariantCoresPrecisionRegistersIEEE compliance
FPv4-SPM4, M33single (32-bit)32 × S (or 16 × D)IEEE 754-2008, default rounding, some exceptions trapped
FPv5-SPM7, M52, M55, M85single32 × SSame, plus VRINT / VSEL / VMINMAX
FPv5-DPM7 option, M85 optionsingle + double32 × D (64-bit each)Full IEEE 754 double
FPv5-HPM55, M85+ half (16-bit)adds FP16Half-precision lanes, used with Helium

Register set

  • S0–S31 (32 × 32-bit) or aliased as D0–D15 (16 × 64-bit).
  • Single FPSCR — rounding mode, condition flags, sticky exceptions.
  • Separate bank from integer R0–R12 — no pressure on GPRs.

Calling convention

  • Hard-float ABI: FP args in S0–S15 / D0–D7; return in S0 / D0.
  • Soft-float ABI: FP args in R0–R3 (uses integer regs) — what v6-M must use.
  • GCC flag: -mfloat-abi=hard -mfpu=fpv5-sp-d16.
08

FPU Context & Lazy Stacking

CPACR — enable FPU

/* Grant privileged + unpriv access to CP10/CP11 */
SCB->CPACR |= (3u << 20) | (3u << 22);
__DSB(); __ISB();

FPCCR — context behaviour

  • ASPEN (Automatic State Preservation Enable) — save FP regs on exception entry if FPCA=1.
  • LSPEN (Lazy Save Enable) — defer the actual writes until the handler touches the FPU.
  • FPCA in CONTROL — set by the first FP instr; cleared by EXC_RETURN with basic-frame bit.

Lazy stacking in a frame

  1. IRQ fires; CPU reserves 18-word extended frame; writes only the basic 8.
  2. Handler runs scalar code — no FP writes yet.
  3. If handler does vmul.f32 s0,s1,s2, CPU stalls, writes the 18 FP words, then continues.
  4. On return, CPU pops the 18 words only if they were written.
Gotcha: if the handler uses memcpy, newer GCC/libc may emit VMOV/VLDM on 8-byte aligned buffers. Disable with -mno-unaligned-access or -mfloat-abi=soft for the handler's TU.
09

Fixed-Point vs Floating-Point

Q15 / Q31 (DSP)float32 (FPv4/5)
Range−1 … +1 − 2⁻¹⁵ or 2⁻³¹±3.4×10³⁸
PrecisionUniform (15 or 31 bits)~23 bits mantissa (worse at small values? better)
Instruction cost1 cycle MAC (packed)1 cycle VMLA.F32
Memoryhalfwords (FIR), packs wellwords, may double buffer size
DeterminismExact, bit-reproducibleRounding depends on FPSCR mode; sticky exceptions
Good forAudio DSP, motor control, wirelessSensor fusion, Kalman filter, geometric compute, float-trained ML
Cortex-M4 with FPU often ends up doing both: critical inner loops in Q15/Q31 DSP, outer-loop housekeeping (gain, state vectors) in float — best of both worlds without two cores.
10

Helium (M-Profile Vector Extension)

  • Introduced with Armv8.1-M (2019). Shipping in Cortex-M52, M55, M85.
  • Eight 128-bit vector registers Q0–Q7. Each is also accessible as D doublewords and S singlewords (aliased with the FPU register bank).
  • Lane configurations: 16 × int8, 8 × int16, 4 × int32, 8 × float16, 4 × float32.
  • Fused multiply-accumulate (FMA), multiply-accumulate with round, saturating, scatter-gather addressing.
  • Tail predication — a vector loop handles the non-multiple-of-lane remainder without a scalar tail.
  • Low-Overhead Branch (LOB)DLS/WLS/LE — a counted loop that compiles to zero-overhead branches.

Design trade-off

NEON on A-profile is a full 128-bit datapath. Helium uses a beat-wise execution model: one 32-bit beat per cycle by default, so silicon area is roughly ¼ of a NEON block — while retaining the same instruction encoding and most of the peak throughput on tight loops.

Why the name?

Helium is the lightweight of the vector extensions. Compared to Arm's SVE / SVE2 (bigger, more dynamic), it's small & dense — like the element.

11

Helium — Beat-Wise Execution

register
lane 3
lane 2
lane 1
lane 0
beat 1
compute
beat 2
compute
compute
beat 3
compute
compute
compute
beat 4
compute
compute
compute

1-beat implementations (M55)

One 32-bit lane per cycle → a 4-lane op takes 4 cycles. But the compiler overlaps independent operations so the back-to-back throughput matches a 4-beat design.

2-beat and dual-beat (M85)

Two lanes per cycle → 4-lane op in 2 cycles. M85 dual-issues a vector and a scalar per cycle — roughly 2× M55 throughput.

12

Helium — Hello World

#include <arm_mve.h>        /* intrinsics */

void vadd_i16(const int16_t *a, const int16_t *b,
              int16_t *c, uint32_t n)
{
    while (n > 0) {
        mve_pred16_t p = vctp16q(n);              /* tail predicate */
        int16x8_t va = vld1q_z_s16(a, p);         /* load up to 8 */
        int16x8_t vb = vld1q_z_s16(b, p);
        int16x8_t vc = vaddq_x_s16(va, vb, p);    /* masked add */
        vst1q_p_s16(c, vc, p);                    /* masked store */
        a += 8; b += 8; c += 8;
        n  = n > 8 ? n - 8 : 0;
    }
}
vctp16q(n) builds a predicate "active if lane index < n". The predicate argument form (vaddq_x, vld1q_z) tells the CPU which lanes to compute / load. On the last iteration where n < 8, the CPU masks lanes automatically — no scalar tail loop needed.
13

Low-Overhead Loop Branch

; Classic loop
    MOV    r0, #N
1:  ; ... body ...
    SUBS   r0, r0, #1
    BNE    1b

; With LOB (Armv8.1-M)
    DLS    lr, #N      ; set LR=N, start of loop
1:  ; ... body ...
    LE     lr, 1b      ; dec LR; branch-if-not-zero
  • DLS (Do-Loop-Start) or WLS (While-Loop-Start) initialises LR = count, records the branch target.
  • LE (Loop End) decrements LR; if non-zero, branches to the recorded target; if zero, falls through.
  • The CPU predicts LE perfectly — no branch-miss penalty, no loop-overhead instructions.
  • Helium LETP / DLSTP variants also set up the tail predicate automatically.
14

Helium for TinyML

  • Quantised NN inference uses int8 activations + int8 weights + int32 accumulator.
  • Helium executes 16 × int8 multiply-accumulates per instruction (VMLAV, VMLAVA — "across" reductions).
  • Combined with the VMLADAVA (dual-lane dot-product) and scatter-gather loads for depthwise convs, CMSIS-NN achieves ~5× speed-up on M55 vs M4 for MobileNet-v1 kernels.
  • FP16 path: fast pre-post processing (resize, normalise, softmax) without falling back to float32.

Real numbers

KernelM4 DSPM55 HeliumSpeed-up
int8 conv2d 1×11.0×~4.8×4.8×
int8 depthwise 3×31.0×~4.2×4.2×
int8 FC matmul1.0×~5.6×5.6×
fp16 softmax (N=128)1.0×~3.1×3.1×

From Arm's CMSIS-NN v5 benchmarks on an Alif Ensemble eval board.

15

Helium Instruction Classes

ClassExamplesPurpose
Load / StoreVLD1Q, VST1Q, VLD2/3/4Contiguous and interleaved loads
Scatter / GatherVLDRB/H/W.Q (vector-of-addresses)Sparse indexing — LUTs, interpolation, hash
Integer arithmeticVADDQ, VSUBQ, VMULQ, VQADDQ (sat)SIMD int8/16/32 math
Multiply-accumulateVMLA, VMLAV, VMLADAVAFIR, GEMM, NN MAC
FloatVADD.F16/F32, VMLA.F16Half/float SIMD
CompareVCMP, VCMPEQ, building predicatesMasked execution
Shuffle / PermuteVREV, VMOVN, VMOVL, VSHRNData reshape, narrowing, widening
ReductionVADDV, VMINAV, VMAXAVScalar result from a vector
Predicate / loopVPT / VPTT / LETP / WLSTPTail predication, LOB
16

CMSIS-DSP

  • Arm's portable DSP library in C.
  • Hand-tuned implementations per core: scalar / DSP-SIMD / Helium.
  • APIs unchanged across cores — only link-time library differs.
  • Covers:
    • Filters (FIR, IIR, biquad, LMS, normalised LMS)
    • Transforms (FFT radix-2/4/8, DCT)
    • Matrix ops (multiply, inverse, Cholesky)
    • Controller math (PID)
    • Statistics, distance functions
#include "arm_math.h"

arm_fir_instance_q15 fir;
q15_t taps[NUM_TAPS];
q15_t state[NUM_TAPS + BLOCK_SIZE - 1];

arm_fir_init_q15(&fir, NUM_TAPS, taps, state,
                 BLOCK_SIZE);

for (;;) {
    wait_for_dma_block(in);
    arm_fir_q15(&fir, in, out, BLOCK_SIZE);
    send_to_dac(out);
}

Same code runs on M0 (scalar), M4 (DSP SIMD), M7 (DSP + FPU), M55 (Helium) — just relink against the right libcmsisdsp.a.

17

CMSIS-NN

  • Companion library for int8 / int16 NN inference.
  • Kernels: fully-connected, conv2d, depthwise conv, softmax, pooling, batch-norm fusion, LSTM cell.
  • Maps to DSP SIMD on M4/M7/M33 and to Helium on M52/M55/M85.
  • The runtime for TensorFlow Lite for Microcontrollers (tflite-micro) and for ExecuTorch on MCU.

End-to-end speed

With CMSIS-NN + Helium on a 400 MHz Cortex-M55:

  • MobileNet-v1 0.25× @ 128×128 int8 → ~14 ms / frame.
  • Keyword-spotting DS-CNN → ~2.4 ms / inference at 100 MHz.
  • Person-detection MobileNetV1 → ~80 ms / frame at 400 MHz.

Compare to M4F (no Helium): same models are 4–6× slower.

18

Autovectorisation vs Intrinsics

Autovectorise

Arm Compiler 6 and GCC 13+ with -march=armv8.1-m.main+mve.fp can auto-vectorise simple loops. Good for portable code; rarely optimal for tight kernels.

Intrinsics (arm_mve.h)

Portable across compilers, close to the metal, easy to reason about. Preferred for DSP / NN kernels. Takes on CPU knowledge of the programmer.

Inline assembly

Last resort — needed only when the intrinsic set doesn't expose a feature (rare) or for precise cycle-count tuning. Pin to a specific core.

Best practice: write the kernel with intrinsics (benchmark vs scalar); leave everything else in portable C and let the compiler auto-vec it. CMSIS-DSP uses this pattern pervasively.
19

FPU Common Pitfalls

1. Forgot to enable CPACR

First FPU instruction → UsageFault (NOCP). SCB->CPACR |= (3u<<20)|(3u<<22); early in SystemInit.

2. Float in ISR without hardfp ABI

ISR compiled with softfp while library is hardfp → wrong register usage. Compile everything with the same -mfloat-abi.

3. VCVT between int and float in tight loop

VCVT.F32.S32 costs ~10 cycles on M4. Batch the converts — or stay in fixed-point.

4. Denormal floats

Denormals on the FPU can be ~30× slower than normals. Set FPSCR.FZ=1 (flush-to-zero) for DSP code. Audio & control loops should always run FZ.

5. Shared volatile FP state

If two priorities use the FPU and only one enables ASPEN+LSPEN correctly, context leaks. Always use the CMSIS defaults.

6. Saturation without clearing Q

Any saturating op leaves APSR.Q=1 until a new MSR APSR_nzcvq, #0. Old code + sticky Q = false "clipping" alarms.

20

Helium Common Pitfalls

1. FPU not enabled

Helium shares the FP register file. CPACR CP10/CP11 must grant access before any MVE instruction.

2. Mismatched lane width in predicate

vctp16q builds a predicate for 8-lane int16 ops. Using it with an int8 op masks the wrong lanes.

3. Unaligned scatter-gather

Scatter-gather accesses at unaligned base may trap on Device memory and miss cache lines. Align data.

4. Tail predication vs cache coherence

On M55 with D-cache, a predicated store still touches a cache line. Pad DMA buffers to lane width.

5. Mixing SVE-style pragmas

Helium is not SVE. #pragma arm vector directives differ. Use __attribute__((target("mve.fp"))) for Helium.

6. Beat-wise ordering assumptions

Do not assume lane ordering in time. Architecturally, beats are observed in program order, but within a beat no ordering is guaranteed.

21

Comparative: M4 vs M7 vs M55 vs M85

FeatureM4M7M55M85
DSP extension
FPUFPv4-SPFPv5-SP/DPFPv5-SP + HPFPv5-SP/DP + HP
Helium (MVE)✓ (1-beat)✓ (2-beat)
CacheL1 I+Dopt I+Dopt I+D
Dual-issue
Max clock~200 MHz600 MHz~400 MHz~700 MHz
Typical FIR GMAC/s0.20.91.63.5
Typical int8 NN GOP/s~0.3~1.0~1.6~3.0

Rough headline numbers; actual performance depends on memory system, cache config, and whether the code is auto-vectorised or hand-tuned.

22

When to Choose What

Stay with scalar Cortex-M0/M0+/M23

UI glue, low-bandwidth sensors, protocol stacks. No DSP needed.

Use Cortex-M4 DSP

Audio codecs up to 48 kHz stereo; 10–50 kHz motor control; light wireless DSP (CRC, scrambling). Great price.

Use Cortex-M7 + FPv5

Heavy FFT workloads, graphics compositing, bridging between DSP and MPU world. Cache + AXI earn their cost.

Use Cortex-M55

TinyML inference < 100 ms latency, multi-channel audio, camera pre-processing. First-generation Helium silicon (Alif, Renesas RA8).

Use Cortex-M85

ML + DSP + TrustZone in one core; highest-performance single-core MCU. PACBTI bonus for safety.

Consider a dedicated NPU

Arm Ethos-U55 / U65 pairs with M55/M85 for > 100 GOPS int8 — Helium handles non-convolution layers, NPU does conv.

23

Ecosystem & Tools

  • CMSIS-DSP / CMSIS-NN — Arm's portable libraries (github.com/ARM-software/CMSIS-DSP, CMSIS-NN).
  • TensorFlow Lite for Microcontrollers — uses CMSIS-NN as its reference MCU backend.
  • ExecuTorch on MCU — new PyTorch path using Helium + Ethos-U.
  • Arm KEIL uVision & Arm Compiler 6 — strongest auto-vec support for Helium today.
  • GCC 13+ / Clang 17+ — Helium intrinsics & auto-vec supported.
  • Corstone-300 / Corstone-310 — Arm reference subsystems containing M55 + Ethos-U55 for rapid SoC design.

DSP interview checklist

  • Explain the difference between SMLAD and SMLALD (32-bit vs 64-bit accumulator).
  • Describe when APSR.Q gets sticky.
  • Walk through the FPU register file and how it shares with Helium.
  • Explain tail-predication and why it makes vector loops branch-free.
  • Why lazy stacking (FPCCR.LSPEN) exists.
24

References

ArmArm Cortex-M Helium Technology white paper (2019, updated 2023)
ArmHelium Programmer's Guide (developer.arm.com/documentation/102102)
ArmCortex-M55 Software Optimization Guide; Cortex-M85 Optimization Guide
CMSIS-DSP & CMSIS-NN — github.com/ARM-software/CMSIS-DSP, github.com/ARM-software/CMSIS-NN — reference kernels
Cameron HughesDigital Signal Processing with Arm Cortex-M Microcontrollers (Mouser/Arm, 2020)
Donoghue, J. — "Efficient quantized NN inference on Cortex-M55 with Helium" (Arm AI Virtual Tech Talks, 2022)
Warden / SitunayakeTinyML (O'Reilly, 2019) — baseline for the MCU inference field

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.