ARM CORTEX-M · PRESENTATION 05

DSP, FPU & Helium (MVE)

Saturating Arithmetic · FPv4 / FPv5 · 128-bit Vector · CMSIS-DSP · CMSIS-NN

From M4 SIMD to M85 TinyML

Why Cortex-M Cares About Signal Processing

Modern MCUs run 48 kHz audio codecs, 250 kHz motor-control current loops, BLE radio baseband post-processing, always-on KWS (keyword spotting) NNs.
Until 2010, these needed either a second chip (TI C28x DSP, ADSP Blackfin) or a bigger A-profile.
Adding a targeted set of SIMD & saturating instructions let a Cortex-M4 displace all of that for the mid-range.
In 2020 Arm did it again with Helium (MVE) — putting 128-bit vector execution on an MCU for edge-ML.

Three tiers today

DSP extension (v7E-M / v8-M Main) — 32-bit SIMD on M4, M7, M33, M35P, M55, M85.
FPU — optional IEEE 754 single- and double-precision on M4 / M7 / M33 / M52 / M55 / M85.
Helium (v8.1-M MVE) — 128-bit beat-wise SIMD on M52, M55, M85. First Arm vector extension outside A-profile.

The DSP Extension — Packed-Operand SIMD

32-bit GPRs are already on the core. The DSP extension reinterprets them as packed vectors and adds instructions that operate on all lanes at once.

32 bits:

int32

2 × 16:

int16

4 × 8:

No new register file — reuses R0–R12.
Adds ~30 instructions: QADD8, QSUB8, SADD16, SSUB16, QADD16, SMLAD, SMLSD, SMMUL, SMMLA, SEL, USAT, SSAT, PKHBT, PKHTB, UXTAB, SXTB16, …
Saturating variants clip to type limits (±32767 for int16 etc.) instead of wrapping — and set the sticky APSR.Q flag.
Total cost: ~5 kgate in the ALU datapath.

SIMD Instructions — Reference

Instruction	Operation	Lanes	Typical use
SADD16 / SSUB16	signed add/sub	2 × int16	FIR half-word addition
QADD16 / QSUB16	saturating add/sub	2 × int16	Audio clipping-safe mix
SADD8 / QADD8	add / saturating add	4 × int8	Image-processing blend
SMUL/SMLAD/SMLSD	signed mult-acc of two 16-bit pairs	2 × int16	FIR / IIR kernel inner loop
SMLAL / SMLALD	signed mult-acc, 64-bit accumulator	2 × int16	Large accumulators for long FIRs
SMMUL / SMMLA	signed <<32 multiply high-word	1 × Q31	Q31 fixed-point IIR
USAT / SSAT	saturate scalar to N-bit	1 × 32	Post-multiply saturation to Q15 / Q7
SEL	byte-wise select using GE flags	4 × int8	Conditional pack after compare
PKHBT / PKHTB	pack bottom+top halfwords	2 × int16	Reassembling after shifted ops
UXTB16 / SXTB16	unpack two bytes → two halfwords	2 × int8→int16	Load 4 bytes, unpack for FIR

DSP Example — Q15 FIR Kernel

/* coeffs packed 2×Q15 per word;
   samples packed 2×Q15 per word.
   Produces acc in Q31. */
q31_t acc = 0;
for (i = 0; i < numTaps/2; i++) {
    q31_t s = *(q31_t *)(pS + 2*i);      /* s1 | s0 */
    q31_t c = *(q31_t *)(pC + 2*i);      /* c1 | c0 */
    acc = __SMLALD(s, c, acc);           /* acc += s0*c0 + s1*c1 */
}
acc = __SSAT(acc >> 15, 16);

Each loop iteration now does two MACs per cycle.
Compare to scalar C:
```
for (i=0;i<N;i++) acc += s[i]*c[i];
```
which is 1 MAC per iteration + loop overhead.
Measured speed-up on STM32F4: ~3.5× for a 64-tap FIR.
CMSIS-DSP's arm_fir_q15 uses this pattern with further unrolling.

Saturating Arithmetic & the Q Flag

APSR.Q — a sticky bit set by any saturating op that actually clipped.
Reset only by writing 0 with MSR APSR_nzcvq.
Lets an algorithm check at end-of-frame whether any sample saturated — warn or reduce gain.
Hardware saturation is free (no conditional branch in the inner loop).

/* check overall saturation for a block */
__set_APSR_nzcvq(0);              /* clear Q */
for (i=0; i<N; i++)
    out[i] = __QADD16(a[i], b[i]);

if (__get_APSR() & (1U << 27))    /* Q flag */
    gain_control_reduce();

Audio DSP loves this. It maps closely to DSP culture from 90s Motorola 56k processors — Arm deliberately modelled the feature on that experience.

The FPU — FPv4 / FPv5

Variant	Cores	Precision	Registers	IEEE compliance
FPv4-SP	M4, M33	single (32-bit)	32 × S (or 16 × D)	IEEE 754-2008, default rounding, some exceptions trapped
FPv5-SP	M7, M52, M55, M85	single	32 × S	Same, plus VRINT / VSEL / VMINMAX
FPv5-DP	M7 option, M85 option	single + double	32 × D (64-bit each)	Full IEEE 754 double
FPv5-HP	M55, M85	+ half (16-bit)	adds FP16	Half-precision lanes, used with Helium

Register set

S0–S31 (32 × 32-bit) or aliased as D0–D15 (16 × 64-bit).
Single FPSCR — rounding mode, condition flags, sticky exceptions.
Separate bank from integer R0–R12 — no pressure on GPRs.

Calling convention

Hard-float ABI: FP args in S0–S15 / D0–D7; return in S0 / D0.
Soft-float ABI: FP args in R0–R3 (uses integer regs) — what v6-M must use.
GCC flag: -mfloat-abi=hard -mfpu=fpv5-sp-d16.

FPU Context & Lazy Stacking

`CPACR` — enable FPU

/* Grant privileged + unpriv access to CP10/CP11 */
SCB->CPACR |= (3u << 20) | (3u << 22);
__DSB(); __ISB();

`FPCCR` — context behaviour

ASPEN (Automatic State Preservation Enable) — save FP regs on exception entry if FPCA=1.
LSPEN (Lazy Save Enable) — defer the actual writes until the handler touches the FPU.
FPCA in CONTROL — set by the first FP instr; cleared by EXC_RETURN with basic-frame bit.

Lazy stacking in a frame

IRQ fires; CPU reserves 18-word extended frame; writes only the basic 8.
Handler runs scalar code — no FP writes yet.
If handler does vmul.f32 s0,s1,s2, CPU stalls, writes the 18 FP words, then continues.
On return, CPU pops the 18 words only if they were written.

Gotcha: if the handler uses memcpy, newer GCC/libc may emit VMOV/VLDM on 8-byte aligned buffers. Disable with -mno-unaligned-access or -mfloat-abi=soft for the handler's TU.

Fixed-Point vs Floating-Point

	Q15 / Q31 (DSP)	float32 (FPv4/5)
Range	−1 … +1 − 2⁻¹⁵ or 2⁻³¹	±3.4×10³⁸
Precision	Uniform (15 or 31 bits)	~23 bits mantissa (worse at small values? better)
Instruction cost	1 cycle MAC (packed)	1 cycle VMLA.F32
Memory	halfwords (FIR), packs well	words, may double buffer size
Determinism	Exact, bit-reproducible	Rounding depends on FPSCR mode; sticky exceptions
Good for	Audio DSP, motor control, wireless	Sensor fusion, Kalman filter, geometric compute, float-trained ML

Cortex-M4 with FPU often ends up doing both: critical inner loops in Q15/Q31 DSP, outer-loop housekeeping (gain, state vectors) in float — best of both worlds without two cores.

Helium (M-Profile Vector Extension)

Introduced with Armv8.1-M (2019). Shipping in Cortex-M52, M55, M85.
Eight 128-bit vector registers Q0–Q7. Each is also accessible as D doublewords and S singlewords (aliased with the FPU register bank).
Lane configurations: 16 × int8, 8 × int16, 4 × int32, 8 × float16, 4 × float32.
Fused multiply-accumulate (FMA), multiply-accumulate with round, saturating, scatter-gather addressing.
Tail predication — a vector loop handles the non-multiple-of-lane remainder without a scalar tail.
Low-Overhead Branch (LOB) — DLS/WLS/LE — a counted loop that compiles to zero-overhead branches.

Design trade-off

NEON on A-profile is a full 128-bit datapath. Helium uses a beat-wise execution model: one 32-bit beat per cycle by default, so silicon area is roughly ¼ of a NEON block — while retaining the same instruction encoding and most of the peak throughput on tight loops.

Why the name?

Helium is the lightweight of the vector extensions. Compared to Arm's SVE / SVE2 (bigger, more dynamic), it's small & dense — like the element.

Helium — Beat-Wise Execution

lane 3

lane 2

lane 1

lane 0

beat 1

compute

beat 2

compute

beat 3

compute

beat 4

compute

—

1-beat implementations (M55)

One 32-bit lane per cycle → a 4-lane op takes 4 cycles. But the compiler overlaps independent operations so the back-to-back throughput matches a 4-beat design.

2-beat and dual-beat (M85)

Two lanes per cycle → 4-lane op in 2 cycles. M85 dual-issues a vector and a scalar per cycle — roughly 2× M55 throughput.

Helium — Hello World

#include <arm_mve.h>        /* intrinsics */

void vadd_i16(const int16_t *a, const int16_t *b,
              int16_t *c, uint32_t n)
{
    while (n > 0) {
        mve_pred16_t p = vctp16q(n);              /* tail predicate */
        int16x8_t va = vld1q_z_s16(a, p);         /* load up to 8 */
        int16x8_t vb = vld1q_z_s16(b, p);
        int16x8_t vc = vaddq_x_s16(va, vb, p);    /* masked add */
        vst1q_p_s16(c, vc, p);                    /* masked store */
        a += 8; b += 8; c += 8;
        n  = n > 8 ? n - 8 : 0;
    }
}

vctp16q(n) builds a predicate "active if lane index < n". The predicate argument form (vaddq_x, vld1q_z) tells the CPU which lanes to compute / load. On the last iteration where n < 8, the CPU masks lanes automatically — no scalar tail loop needed.

Low-Overhead Loop Branch

; Classic loop
    MOV    r0, #N
1:  ; ... body ...
    SUBS   r0, r0, #1
    BNE    1b

; With LOB (Armv8.1-M)
    DLS    lr, #N      ; set LR=N, start of loop
1:  ; ... body ...
    LE     lr, 1b      ; dec LR; branch-if-not-zero

DLS (Do-Loop-Start) or WLS (While-Loop-Start) initialises LR = count, records the branch target.
LE (Loop End) decrements LR; if non-zero, branches to the recorded target; if zero, falls through.
The CPU predicts LE perfectly — no branch-miss penalty, no loop-overhead instructions.
Helium LETP / DLSTP variants also set up the tail predicate automatically.

Helium for TinyML

Quantised NN inference uses int8 activations + int8 weights + int32 accumulator.
Helium executes 16 × int8 multiply-accumulates per instruction (VMLAV, VMLAVA — "across" reductions).
Combined with the VMLADAVA (dual-lane dot-product) and scatter-gather loads for depthwise convs, CMSIS-NN achieves ~5× speed-up on M55 vs M4 for MobileNet-v1 kernels.
FP16 path: fast pre-post processing (resize, normalise, softmax) without falling back to float32.

Real numbers

Kernel	M4 DSP	M55 Helium	Speed-up
int8 conv2d 1×1	1.0×	~4.8×	4.8×
int8 depthwise 3×3	1.0×	~4.2×	4.2×
int8 FC matmul	1.0×	~5.6×	5.6×
fp16 softmax (N=128)	1.0×	~3.1×	3.1×

From Arm's CMSIS-NN v5 benchmarks on an Alif Ensemble eval board.

Helium Instruction Classes

Class	Examples	Purpose
Load / Store	VLD1Q, VST1Q, VLD2/3/4	Contiguous and interleaved loads
Scatter / Gather	VLDRB/H/W.Q (vector-of-addresses)	Sparse indexing — LUTs, interpolation, hash
Integer arithmetic	VADDQ, VSUBQ, VMULQ, VQADDQ (sat)	SIMD int8/16/32 math
Multiply-accumulate	VMLA, VMLAV, VMLADAVA	FIR, GEMM, NN MAC
Float	VADD.F16/F32, VMLA.F16	Half/float SIMD
Compare	VCMP, VCMPEQ, building predicates	Masked execution
Shuffle / Permute	VREV, VMOVN, VMOVL, VSHRN	Data reshape, narrowing, widening
Reduction	VADDV, VMINAV, VMAXAV	Scalar result from a vector
Predicate / loop	VPT / VPTT / LETP / WLSTP	Tail predication, LOB

CMSIS-DSP

Arm's portable DSP library in C.
Hand-tuned implementations per core: scalar / DSP-SIMD / Helium.
APIs unchanged across cores — only link-time library differs.
Covers:
- Filters (FIR, IIR, biquad, LMS, normalised LMS)
- Transforms (FFT radix-2/4/8, DCT)
- Matrix ops (multiply, inverse, Cholesky)
- Controller math (PID)
- Statistics, distance functions

#include "arm_math.h"

arm_fir_instance_q15 fir;
q15_t taps[NUM_TAPS];
q15_t state[NUM_TAPS + BLOCK_SIZE - 1];

arm_fir_init_q15(&fir, NUM_TAPS, taps, state,
                 BLOCK_SIZE);

for (;;) {
    wait_for_dma_block(in);
    arm_fir_q15(&fir, in, out, BLOCK_SIZE);
    send_to_dac(out);
}

Same code runs on M0 (scalar), M4 (DSP SIMD), M7 (DSP + FPU), M55 (Helium) — just relink against the right libcmsisdsp.a.

CMSIS-NN

Companion library for int8 / int16 NN inference.
Kernels: fully-connected, conv2d, depthwise conv, softmax, pooling, batch-norm fusion, LSTM cell.
Maps to DSP SIMD on M4/M7/M33 and to Helium on M52/M55/M85.
The runtime for TensorFlow Lite for Microcontrollers (tflite-micro) and for ExecuTorch on MCU.

End-to-end speed

With CMSIS-NN + Helium on a 400 MHz Cortex-M55:

MobileNet-v1 0.25× @ 128×128 int8 → ~14 ms / frame.
Keyword-spotting DS-CNN → ~2.4 ms / inference at 100 MHz.
Person-detection MobileNetV1 → ~80 ms / frame at 400 MHz.

Compare to M4F (no Helium): same models are 4–6× slower.

Autovectorisation vs Intrinsics

Autovectorise

Arm Compiler 6 and GCC 13+ with -march=armv8.1-m.main+mve.fp can auto-vectorise simple loops. Good for portable code; rarely optimal for tight kernels.

Intrinsics (`arm_mve.h`)

Portable across compilers, close to the metal, easy to reason about. Preferred for DSP / NN kernels. Takes on CPU knowledge of the programmer.

Inline assembly

Last resort — needed only when the intrinsic set doesn't expose a feature (rare) or for precise cycle-count tuning. Pin to a specific core.

Best practice: write the kernel with intrinsics (benchmark vs scalar); leave everything else in portable C and let the compiler auto-vec it. CMSIS-DSP uses this pattern pervasively.

FPU Common Pitfalls

1. Forgot to enable CPACR

First FPU instruction → UsageFault (NOCP). SCB->CPACR |= (3u<<20)|(3u<<22); early in SystemInit.

2. Float in ISR without hardfp ABI

ISR compiled with softfp while library is hardfp → wrong register usage. Compile everything with the same -mfloat-abi.

3. VCVT between int and float in tight loop

VCVT.F32.S32 costs ~10 cycles on M4. Batch the converts — or stay in fixed-point.

4. Denormal floats

Denormals on the FPU can be ~30× slower than normals. Set FPSCR.FZ=1 (flush-to-zero) for DSP code. Audio & control loops should always run FZ.

5. Shared volatile FP state

If two priorities use the FPU and only one enables ASPEN+LSPEN correctly, context leaks. Always use the CMSIS defaults.

6. Saturation without clearing Q

Any saturating op leaves APSR.Q=1 until a new MSR APSR_nzcvq, #0. Old code + sticky Q = false "clipping" alarms.

Helium Common Pitfalls

1. FPU not enabled

Helium shares the FP register file. CPACR CP10/CP11 must grant access before any MVE instruction.

2. Mismatched lane width in predicate

vctp16q builds a predicate for 8-lane int16 ops. Using it with an int8 op masks the wrong lanes.

3. Unaligned scatter-gather

Scatter-gather accesses at unaligned base may trap on Device memory and miss cache lines. Align data.

4. Tail predication vs cache coherence

On M55 with D-cache, a predicated store still touches a cache line. Pad DMA buffers to lane width.

5. Mixing SVE-style pragmas

Helium is not SVE. #pragma arm vector directives differ. Use __attribute__((target("mve.fp"))) for Helium.

6. Beat-wise ordering assumptions

Do not assume lane ordering in time. Architecturally, beats are observed in program order, but within a beat no ordering is guaranteed.

Comparative: M4 vs M7 vs M55 vs M85

Feature	M4	M7	M55	M85
DSP extension	✓	✓	✓	✓
FPU	FPv4-SP	FPv5-SP/DP	FPv5-SP + HP	FPv5-SP/DP + HP
Helium (MVE)	—	—	✓ (1-beat)	✓ (2-beat)
Cache	—	L1 I+D	opt I+D	opt I+D
Dual-issue	—	✓	—	✓
Max clock	~200 MHz	600 MHz	~400 MHz	~700 MHz
Typical FIR GMAC/s	0.2	0.9	1.6	3.5
Typical int8 NN GOP/s	~0.3	~1.0	~1.6	~3.0

Rough headline numbers; actual performance depends on memory system, cache config, and whether the code is auto-vectorised or hand-tuned.

When to Choose What

Stay with scalar Cortex-M0/M0+/M23

UI glue, low-bandwidth sensors, protocol stacks. No DSP needed.

Use Cortex-M4 DSP

Audio codecs up to 48 kHz stereo; 10–50 kHz motor control; light wireless DSP (CRC, scrambling). Great price.

Use Cortex-M7 + FPv5

Heavy FFT workloads, graphics compositing, bridging between DSP and MPU world. Cache + AXI earn their cost.

Use Cortex-M55

TinyML inference < 100 ms latency, multi-channel audio, camera pre-processing. First-generation Helium silicon (Alif, Renesas RA8).

Use Cortex-M85

ML + DSP + TrustZone in one core; highest-performance single-core MCU. PACBTI bonus for safety.

Consider a dedicated NPU

Arm Ethos-U55 / U65 pairs with M55/M85 for > 100 GOPS int8 — Helium handles non-convolution layers, NPU does conv.

Ecosystem & Tools

CMSIS-DSP / CMSIS-NN — Arm's portable libraries (github.com/ARM-software/CMSIS-DSP, CMSIS-NN).
TensorFlow Lite for Microcontrollers — uses CMSIS-NN as its reference MCU backend.
ExecuTorch on MCU — new PyTorch path using Helium + Ethos-U.
Arm KEIL uVision & Arm Compiler 6 — strongest auto-vec support for Helium today.
GCC 13+ / Clang 17+ — Helium intrinsics & auto-vec supported.
Corstone-300 / Corstone-310 — Arm reference subsystems containing M55 + Ethos-U55 for rapid SoC design.

DSP interview checklist

Explain the difference between SMLAD and SMLALD (32-bit vs 64-bit accumulator).
Describe when APSR.Q gets sticky.
Walk through the FPU register file and how it shares with Helium.
Explain tail-predication and why it makes vector loops branch-free.
Why lazy stacking (FPCCR.LSPEN) exists.

References

Arm — Arm Cortex-M Helium Technology white paper (2019, updated 2023)
Arm — Helium Programmer's Guide (developer.arm.com/documentation/102102)
Arm — Cortex-M55 Software Optimization Guide; Cortex-M85 Optimization Guide
CMSIS-DSP & CMSIS-NN — github.com/ARM-software/CMSIS-DSP, github.com/ARM-software/CMSIS-NN — reference kernels
Cameron Hughes — Digital Signal Processing with Arm Cortex-M Microcontrollers (Mouser/Arm, 2020)
Donoghue, J. — "Efficient quantized NN inference on Cortex-M55 with Helium" (Arm AI Virtual Tech Talks, 2022)
Warden / Situnayake — TinyML (O'Reilly, 2019) — baseline for the MCU inference field

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.

DSP, FPU & Helium (MVE)

Why Cortex-M Cares About Signal Processing

Three tiers today

The DSP Extension — Packed-Operand SIMD

SIMD Instructions — Reference

DSP Example — Q15 FIR Kernel

Saturating Arithmetic & the Q Flag

The FPU — FPv4 / FPv5

Register set

Calling convention

FPU Context & Lazy Stacking

CPACR — enable FPU

FPCCR — context behaviour

Lazy stacking in a frame

Fixed-Point vs Floating-Point

Helium (M-Profile Vector Extension)

Design trade-off

Why the name?

Helium — Beat-Wise Execution

1-beat implementations (M55)

2-beat and dual-beat (M85)

Helium — Hello World

Low-Overhead Loop Branch

Helium for TinyML

Real numbers

Helium Instruction Classes

CMSIS-DSP

CMSIS-NN

End-to-end speed

Autovectorisation vs Intrinsics

Autovectorise

Intrinsics (arm_mve.h)

Inline assembly

FPU Common Pitfalls

1. Forgot to enable CPACR

2. Float in ISR without hardfp ABI

3. VCVT between int and float in tight loop

4. Denormal floats

5. Shared volatile FP state

6. Saturation without clearing Q

Helium Common Pitfalls

1. FPU not enabled

2. Mismatched lane width in predicate

3. Unaligned scatter-gather

4. Tail predication vs cache coherence

5. Mixing SVE-style pragmas

6. Beat-wise ordering assumptions

Comparative: M4 vs M7 vs M55 vs M85

When to Choose What

Stay with scalar Cortex-M0/M0+/M23

Use Cortex-M4 DSP

Use Cortex-M7 + FPv5

Use Cortex-M55

Use Cortex-M85

Consider a dedicated NPU

Ecosystem & Tools

DSP interview checklist

References

`CPACR` — enable FPU

`FPCCR` — context behaviour

Intrinsics (`arm_mve.h`)