32-bit GPRs are already on the core. The DSP extension reinterprets them as packed vectors and adds instructions that operate on all lanes at once.
APSR.Q flag.| Instruction | Operation | Lanes | Typical use |
|---|---|---|---|
| SADD16 / SSUB16 | signed add/sub | 2 × int16 | FIR half-word addition |
| QADD16 / QSUB16 | saturating add/sub | 2 × int16 | Audio clipping-safe mix |
| SADD8 / QADD8 | add / saturating add | 4 × int8 | Image-processing blend |
| SMUL/SMLAD/SMLSD | signed mult-acc of two 16-bit pairs | 2 × int16 | FIR / IIR kernel inner loop |
| SMLAL / SMLALD | signed mult-acc, 64-bit accumulator | 2 × int16 | Large accumulators for long FIRs |
| SMMUL / SMMLA | signed <<32 multiply high-word | 1 × Q31 | Q31 fixed-point IIR |
| USAT / SSAT | saturate scalar to N-bit | 1 × 32 | Post-multiply saturation to Q15 / Q7 |
| SEL | byte-wise select using GE flags | 4 × int8 | Conditional pack after compare |
| PKHBT / PKHTB | pack bottom+top halfwords | 2 × int16 | Reassembling after shifted ops |
| UXTB16 / SXTB16 | unpack two bytes → two halfwords | 2 × int8→int16 | Load 4 bytes, unpack for FIR |
/* coeffs packed 2×Q15 per word;
samples packed 2×Q15 per word.
Produces acc in Q31. */
q31_t acc = 0;
for (i = 0; i < numTaps/2; i++) {
q31_t s = *(q31_t *)(pS + 2*i); /* s1 | s0 */
q31_t c = *(q31_t *)(pC + 2*i); /* c1 | c0 */
acc = __SMLALD(s, c, acc); /* acc += s0*c0 + s1*c1 */
}
acc = __SSAT(acc >> 15, 16);
for (i=0;i<N;i++) acc += s[i]*c[i];
which is 1 MAC per iteration + loop overhead.arm_fir_q15 uses this pattern with further unrolling.APSR.Q — a sticky bit set by any saturating op that actually clipped.MSR APSR_nzcvq./* check overall saturation for a block */
__set_APSR_nzcvq(0); /* clear Q */
for (i=0; i<N; i++)
out[i] = __QADD16(a[i], b[i]);
if (__get_APSR() & (1U << 27)) /* Q flag */
gain_control_reduce();
| Variant | Cores | Precision | Registers | IEEE compliance |
|---|---|---|---|---|
| FPv4-SP | M4, M33 | single (32-bit) | 32 × S (or 16 × D) | IEEE 754-2008, default rounding, some exceptions trapped |
| FPv5-SP | M7, M52, M55, M85 | single | 32 × S | Same, plus VRINT / VSEL / VMINMAX |
| FPv5-DP | M7 option, M85 option | single + double | 32 × D (64-bit each) | Full IEEE 754 double |
| FPv5-HP | M55, M85 | + half (16-bit) | adds FP16 | Half-precision lanes, used with Helium |
-mfloat-abi=hard -mfpu=fpv5-sp-d16.CPACR — enable FPU/* Grant privileged + unpriv access to CP10/CP11 */
SCB->CPACR |= (3u << 20) | (3u << 22);
__DSB(); __ISB();
FPCCR — context behaviourvmul.f32 s0,s1,s2, CPU stalls, writes the 18 FP words, then continues.memcpy, newer GCC/libc may emit VMOV/VLDM on 8-byte aligned buffers. Disable with -mno-unaligned-access or -mfloat-abi=soft for the handler's TU.
| Q15 / Q31 (DSP) | float32 (FPv4/5) | |
|---|---|---|
| Range | −1 … +1 − 2⁻¹⁵ or 2⁻³¹ | ±3.4×10³⁸ |
| Precision | Uniform (15 or 31 bits) | ~23 bits mantissa (worse at small values? better) |
| Instruction cost | 1 cycle MAC (packed) | 1 cycle VMLA.F32 |
| Memory | halfwords (FIR), packs well | words, may double buffer size |
| Determinism | Exact, bit-reproducible | Rounding depends on FPSCR mode; sticky exceptions |
| Good for | Audio DSP, motor control, wireless | Sensor fusion, Kalman filter, geometric compute, float-trained ML |
Q0–Q7. Each is also accessible as D doublewords and S singlewords (aliased with the FPU register bank).DLS/WLS/LE — a counted loop that compiles to zero-overhead branches.NEON on A-profile is a full 128-bit datapath. Helium uses a beat-wise execution model: one 32-bit beat per cycle by default, so silicon area is roughly ¼ of a NEON block — while retaining the same instruction encoding and most of the peak throughput on tight loops.
Helium is the lightweight of the vector extensions. Compared to Arm's SVE / SVE2 (bigger, more dynamic), it's small & dense — like the element.
One 32-bit lane per cycle → a 4-lane op takes 4 cycles. But the compiler overlaps independent operations so the back-to-back throughput matches a 4-beat design.
Two lanes per cycle → 4-lane op in 2 cycles. M85 dual-issues a vector and a scalar per cycle — roughly 2× M55 throughput.
#include <arm_mve.h> /* intrinsics */
void vadd_i16(const int16_t *a, const int16_t *b,
int16_t *c, uint32_t n)
{
while (n > 0) {
mve_pred16_t p = vctp16q(n); /* tail predicate */
int16x8_t va = vld1q_z_s16(a, p); /* load up to 8 */
int16x8_t vb = vld1q_z_s16(b, p);
int16x8_t vc = vaddq_x_s16(va, vb, p); /* masked add */
vst1q_p_s16(c, vc, p); /* masked store */
a += 8; b += 8; c += 8;
n = n > 8 ? n - 8 : 0;
}
}
vctp16q(n) builds a predicate "active if lane index < n". The predicate argument form (vaddq_x, vld1q_z) tells the CPU which lanes to compute / load. On the last iteration where n < 8, the CPU masks lanes automatically — no scalar tail loop needed.
; Classic loop
MOV r0, #N
1: ; ... body ...
SUBS r0, r0, #1
BNE 1b
; With LOB (Armv8.1-M)
DLS lr, #N ; set LR=N, start of loop
1: ; ... body ...
LE lr, 1b ; dec LR; branch-if-not-zero
LETP / DLSTP variants also set up the tail predicate automatically.| Kernel | M4 DSP | M55 Helium | Speed-up |
|---|---|---|---|
| int8 conv2d 1×1 | 1.0× | ~4.8× | 4.8× |
| int8 depthwise 3×3 | 1.0× | ~4.2× | 4.2× |
| int8 FC matmul | 1.0× | ~5.6× | 5.6× |
| fp16 softmax (N=128) | 1.0× | ~3.1× | 3.1× |
From Arm's CMSIS-NN v5 benchmarks on an Alif Ensemble eval board.
| Class | Examples | Purpose |
|---|---|---|
| Load / Store | VLD1Q, VST1Q, VLD2/3/4 | Contiguous and interleaved loads |
| Scatter / Gather | VLDRB/H/W.Q (vector-of-addresses) | Sparse indexing — LUTs, interpolation, hash |
| Integer arithmetic | VADDQ, VSUBQ, VMULQ, VQADDQ (sat) | SIMD int8/16/32 math |
| Multiply-accumulate | VMLA, VMLAV, VMLADAVA | FIR, GEMM, NN MAC |
| Float | VADD.F16/F32, VMLA.F16 | Half/float SIMD |
| Compare | VCMP, VCMPEQ, building predicates | Masked execution |
| Shuffle / Permute | VREV, VMOVN, VMOVL, VSHRN | Data reshape, narrowing, widening |
| Reduction | VADDV, VMINAV, VMAXAV | Scalar result from a vector |
| Predicate / loop | VPT / VPTT / LETP / WLSTP | Tail predication, LOB |
#include "arm_math.h"
arm_fir_instance_q15 fir;
q15_t taps[NUM_TAPS];
q15_t state[NUM_TAPS + BLOCK_SIZE - 1];
arm_fir_init_q15(&fir, NUM_TAPS, taps, state,
BLOCK_SIZE);
for (;;) {
wait_for_dma_block(in);
arm_fir_q15(&fir, in, out, BLOCK_SIZE);
send_to_dac(out);
}
Same code runs on M0 (scalar), M4 (DSP SIMD), M7 (DSP + FPU), M55 (Helium) — just relink against the right libcmsisdsp.a.
With CMSIS-NN + Helium on a 400 MHz Cortex-M55:
Compare to M4F (no Helium): same models are 4–6× slower.
Arm Compiler 6 and GCC 13+ with -march=armv8.1-m.main+mve.fp can auto-vectorise simple loops. Good for portable code; rarely optimal for tight kernels.
arm_mve.h)Portable across compilers, close to the metal, easy to reason about. Preferred for DSP / NN kernels. Takes on CPU knowledge of the programmer.
Last resort — needed only when the intrinsic set doesn't expose a feature (rare) or for precise cycle-count tuning. Pin to a specific core.
First FPU instruction → UsageFault (NOCP). SCB->CPACR |= (3u<<20)|(3u<<22); early in SystemInit.
ISR compiled with softfp while library is hardfp → wrong register usage. Compile everything with the same -mfloat-abi.
VCVT.F32.S32 costs ~10 cycles on M4. Batch the converts — or stay in fixed-point.
Denormals on the FPU can be ~30× slower than normals. Set FPSCR.FZ=1 (flush-to-zero) for DSP code. Audio & control loops should always run FZ.
If two priorities use the FPU and only one enables ASPEN+LSPEN correctly, context leaks. Always use the CMSIS defaults.
Any saturating op leaves APSR.Q=1 until a new MSR APSR_nzcvq, #0. Old code + sticky Q = false "clipping" alarms.
Helium shares the FP register file. CPACR CP10/CP11 must grant access before any MVE instruction.
vctp16q builds a predicate for 8-lane int16 ops. Using it with an int8 op masks the wrong lanes.
Scatter-gather accesses at unaligned base may trap on Device memory and miss cache lines. Align data.
On M55 with D-cache, a predicated store still touches a cache line. Pad DMA buffers to lane width.
Helium is not SVE. #pragma arm vector directives differ. Use __attribute__((target("mve.fp"))) for Helium.
Do not assume lane ordering in time. Architecturally, beats are observed in program order, but within a beat no ordering is guaranteed.
| Feature | M4 | M7 | M55 | M85 |
|---|---|---|---|---|
| DSP extension | ✓ | ✓ | ✓ | ✓ |
| FPU | FPv4-SP | FPv5-SP/DP | FPv5-SP + HP | FPv5-SP/DP + HP |
| Helium (MVE) | — | — | ✓ (1-beat) | ✓ (2-beat) |
| Cache | — | L1 I+D | opt I+D | opt I+D |
| Dual-issue | — | ✓ | — | ✓ |
| Max clock | ~200 MHz | 600 MHz | ~400 MHz | ~700 MHz |
| Typical FIR GMAC/s | 0.2 | 0.9 | 1.6 | 3.5 |
| Typical int8 NN GOP/s | ~0.3 | ~1.0 | ~1.6 | ~3.0 |
Rough headline numbers; actual performance depends on memory system, cache config, and whether the code is auto-vectorised or hand-tuned.
UI glue, low-bandwidth sensors, protocol stacks. No DSP needed.
Audio codecs up to 48 kHz stereo; 10–50 kHz motor control; light wireless DSP (CRC, scrambling). Great price.
Heavy FFT workloads, graphics compositing, bridging between DSP and MPU world. Cache + AXI earn their cost.
TinyML inference < 100 ms latency, multi-channel audio, camera pre-processing. First-generation Helium silicon (Alif, Renesas RA8).
ML + DSP + TrustZone in one core; highest-performance single-core MCU. PACBTI bonus for safety.
Arm Ethos-U55 / U65 pairs with M55/M85 for > 100 GOPS int8 — Helium handles non-convolution layers, NPU does conv.
SMLAD and SMLALD (32-bit vs 64-bit accumulator).APSR.Q gets sticky.
Arm — Arm Cortex-M Helium Technology white paper (2019, updated 2023)
Arm — Helium Programmer's Guide (developer.arm.com/documentation/102102)
Arm — Cortex-M55 Software Optimization Guide; Cortex-M85 Optimization Guide
CMSIS-DSP & CMSIS-NN — github.com/ARM-software/CMSIS-DSP, github.com/ARM-software/CMSIS-NN — reference kernels
Cameron Hughes — Digital Signal Processing with Arm Cortex-M Microcontrollers (Mouser/Arm, 2020)
Donoghue, J. — "Efficient quantized NN inference on Cortex-M55 with Helium" (Arm AI Virtual Tech Talks, 2022)
Warden / Situnayake — TinyML (O'Reilly, 2019) — baseline for the MCU inference field
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.