ARM CORTEX-A · PRESENTATION 04

Vector Extensions — NEON, SVE, SVE2, SME

Thirty years of Arm SIMD · Fixed-width → Scalable → Matrix

NEON · SVE · SVE2 · SME · Bfloat16 · INT8 dot-product · Crypto · Fujitsu A64FX

Evolution of Arm SIMD

Year	Extension	Width	Shipping in	Key idea
2005	NEON (v1)	128-bit fixed	Cortex-A8	First Arm SIMD; integer + SP float
2011	NEON + full DP	128-bit fixed	Armv8-A (A53/A57)	Unified with FPU; 32 × 128-bit V-regs
2016	FP16 + dot-product (SDOT/UDOT)	128-bit fixed	Armv8.2-A, A75+	INT8 × INT8 → INT32 accumulate; ML
2017	SVE (optional)	128 – 2048 bit	Fujitsu A64FX only	Vector-Length Agnostic; HPC
2020	bfloat16 / matmul-INT8	128-bit fixed	Armv8.6-A	ML training dtypes in NEON
2021	SVE2 (mandatory)	128 – 2048 bit	Armv9-A (X2, A710, A510)	SVE for mobile; replaces NEON for new code
2023	SME / SME2	SVL × SVL tiles	Armv9.2-A	Matrix outer-product; streaming mode

NEON — Advanced SIMD

32 × 128-bit V-registers in AArch64 (V0–V31). In AArch32 they overlay as 32 × D-regs or 16 × Q-regs.
Each V-reg can be addressed as:
- 16 × B (byte), 8 × H (halfword), 4 × S (word), 2 × D (doubleword)
- 4 × FP32, 8 × FP16 (if FP16 feature), 2 × FP64
Lane-wise arithmetic: ADD V0.4S, V1.4S, V2.4S.
Saturating arithmetic, pairwise ops (ADDP, UADALP), cross-lane shuffles (EXT, UZP, ZIP, TRN, TBL).
Widening / narrowing pairs: UADDL (add, widen), ADDHN (add high narrow).
Crypto extension piggybacks on V-regs: AESE/AESD/AESMC, SHA1H, SHA256H, PMULL (GF(2^64)).

// memcpy tail using NEON 128-bit loads
// (glibc-style AArch64 memcpy)

ldp   q0, q1, [x1, #-32]    // last 32 bytes
ldr   q2, [x1, #-48]        // prior 16
stp   q0, q1, [x0, #-32]
str   q2, [x0, #-48]

// SDOT (INT8 × INT8 → INT32 accumulate)
sdot  v3.4s, v4.16b, v5.16b
// per-lane: v3[i] += sum(v4[4i..4i+3] * v5[4i..4i+3])

// 4x4 matmul kernel (FP32)
fmla  v0.4s, v4.4s, v8.s[0]
fmla  v0.4s, v5.4s, v8.s[1]
fmla  v0.4s, v6.4s, v8.s[2]
fmla  v0.4s, v7.4s, v8.s[3]

Why NEON Needed a Successor

Fixed 128-bit width. Compiler has to pick: vectorise for 128-bit or widen the ABI and break compat. HPC wants 512-bit; mobile wants 128-bit; both want the same binary.
Tail handling is painful. Loop epilogues for non-multiple-of-4 counts dominate code size and branch footprint.
No predication. Conditional execution per-lane requires mask + blend sequences; expensive.
Limited gather/scatter. NEON can do TBL-based shuffles, but no native indexed load/store like SSE / AVX-512.
These gaps made Arm uncompetitive in HPC and ML training, which is why SVE was designed in close collaboration with Fujitsu for the K successor (Fugaku).

The width-agnostic trick

SVE code compiles once for all widths 128–2048 in 128-bit increments. The runtime vector length is queried via RDVL; loops use WHILELT predicates that auto-mask the last iteration.

Fujitsu A64FX

512-bit SVE, 48 cores per PE, HBM2, 2.7 PF/s. Powered Fugaku, #1 Top500 in 2020-21. Proved Arm could do sustained HPC.

SVE — Scalable Vector Extension

32 × Z-registers of implementation-defined length (128 – 2048 bit in 128-bit steps). Low 128 bits of each overlays its NEON V-reg.
16 × P-registers (predicate), 1 bit per byte of vector — so up to 256 predicate bits.
VLA (Vector-Length-Agnostic) programming: loops written with WHILELT and INCP, no explicit loop count.
First-fault loads — a vector load that doesn't fault on later elements even if they would; enables strlen-style ops.
Gather / scatter: indexed load/store in one instruction.
Implementations: A64FX (512-bit), Graviton 3 (256-bit), NVIDIA Grace (128-bit per core), Microsoft Cobalt (128-bit). Phone SoCs so far: 128-bit in SVE2.

// Canonical SVE loop — daxpy y[i] += a*x[i]

  mov     x3, #0                // i = 0
  whilelt p0.d, x3, x4          // p0 = (i < n) mask
  b.none  2f

1:
  ld1d    z0.d, p0/z, [x0, x3, lsl #3]   // x[i..]
  ld1d    z1.d, p0/z, [x1, x3, lsl #3]   // y[i..]
  fmla    z1.d, p0/m, z0.d, z2.d         // y += a*x (masked)
  st1d    z1.d, p0,   [x1, x3, lsl #3]   // y[i..]
  incd    x3                              // i += VL/8
  whilelt p0.d, x3, x4
  b.first 1b                               // loop if any active
2:
  ret

Predication — Per-lane Masking

SVE operations take an optional governing predicate: p0/z (zeroing inactive) or p0/m (merging — leave dst unchanged).
Compare ops write a predicate: CMPGT P1.S, P0/Z, Z0.S, Z1.S.
WHILELT/WHILELE/WHILELO/WHILELS create loop-termination predicates (signed/unsigned, inclusive/exclusive). Key to VLA loops.
PTRUE / PFALSE set an all-true / all-false predicate.
Logical predicate ops (AND/ORR/EOR/BIC between predicates) let you combine masks.
INCP / DECP — increment X-register by active lane count. Replaces loop-counter math.

No separate "scalar tail"

In NEON, a loop processes 4 × FP32 per iteration and you need a post-loop to handle the 1-3 leftover elements. In SVE, the same loop's last iteration has fewer active lanes — no tail code at all.

First-fault loads

LDFF1D — load that delivers as many elements as were legal; sets FFR (First Fault Register) to the mask of successful lanes. Enables a vectorised strlen that never reads past the zero byte.

SVE2 — SVE for Everyone (Armv9-A)

Armv9-A mandates SVE2. First implementations: Cortex-X2, A710, A510 (all 128-bit).
Adds hundreds of instructions borrowed from NEON that were missing in SVE1:
- Complex FP (CADD, CMLA)
- Saturating integer, rounding shifts
- Bitwise permutes & interleaves
- String-match ops (MATCH, NMATCH) — serves memcmp, strlen, strchr directly
- Cryptography: SHA3, SM3/SM4, PMULL128
Bf16 + INT8 matmul instructions (BFMMLA, SMMLA, UMMLA) — the ML / LLM-inference workhorse.
First-party mandates SVE2 for new Armv9-A code. Expect NEON to stick around for 10+ years but new-kernel/toolchain energy goes into SVE2.

// SVE2 string match — compare 16 input bytes
// against 16 candidate bytes, lane-wise ANY-match

    ptrue   p0.b                     // all lanes active
    ld1b    z0.b, p0/z, [x0]         // input
    ld1b    z1.b, p0/z, [x1]         // candidates
    match   p1.b, p0/z, z0.b, z1.b   // p1 = any match

// BF16 matmul — 4x4 block of FP32 accumulator,
// packed BF16 A and B

    bfmmla  z0.s, z4.h, z8.h
// accumulates 2x2 outer-product of 4-element dot products

SME — Scalable Matrix Extension (Armv9.2-A)

Adds a 2-D tile storage: ZA, SVL × SVL bytes (SVL = streaming vector length, implementation-defined, power-of-two from 128 to 2048 bits).
Four "slices" of ZA can be accessed as 1-D SVE vectors: ZA0H / ZA0V etc.
Core instruction: outer product into ZA tile. Example: FMOPA FP32 outer product into a ZA.S tile.
Runs in a new "streaming SVE" mode — SVE ops operate at SVL (can be larger than SVE's normal VL).
Target: GEMM workloads, LLM matmul, DSP convolutions — the same territory Apple AMX has been in since 2019.
SME2 (v9.3) expands with multi-vector loads and FP32-to-BF16 converts — plus RME hooks for confidential ML.

Streaming mode

Regular (non-streaming) SVE uses the CPU's FP/SIMD unit. Streaming SVE borrows a dedicated hardware tile + outer-product engine. Switching modes is software-controlled via PSTATE.SM and SMSTART / SMSTOP.

Why matrices

Modern ML is dominated by matmul. A tile of 128×128 FP16 = 32 KB; a streaming outer-product issues that much MAC per instruction. Dramatically higher throughput than SVE vector ops.

NEON vs SVE vs SME — Pick the Tool

Aspect	NEON	SVE / SVE2	SME
Width	128-bit fixed	128-2048 bit, scalable	SVL × SVL tile
Predication	No (mask+blend)	Yes (16 P-regs)	Yes + 2-D
Tail handling	Scalar epilogue	WHILELT auto-mask	Tile slice masks
Gather/scatter	TBL only	Native	Multi-vector load
Best for	Media, codecs, crypto	HPC, BLAS, ML inference	GEMM / LLM
First shipped	Cortex-A8 (2009)	A64FX (2020), Cortex X2 (2022)	TBD (2025 flagships)
Mandatory?	v8.0-A	v9-A (SVE2)	v9.2-A optional

In practice: code targeting mobile + server should prefer SVE2 going forward. NEON intrinsics remain for legacy + codec libraries. SME is for GEMM-heavy kernels (LLM inference, cblas_sgemm).

Integer SIMD Dot-Product — the ML Fast-Path

UDOT / SDOT (Armv8.2-A) — 4× INT8 × INT8 → INT32 accumulate, per lane. 128-bit NEON = 16 byte inputs × 2 = 8 INT32 accumulators per instruction = 32 MACs/cycle at 1.0 CPI.
SMMLA / UMMLA / USMMLA (Armv8.6-A) — 2×8 × 8×2 matrix multiply into 2×2 INT32, in one instruction = 64 MACs/cycle.
Variants with bf16 (BFDOT/BFMMLA) target training-accuracy ML.
This is why XNNPACK / Arm Compute Library / LLaMA.cpp's AArch64 path gets 2-3× over plain FP32.
The same instructions exist in SVE2 — scaled to the chosen vector length.

// INT8 GEMM micro-kernel (NEON)
// A: 4 x K, B: K x 16, C: 4 x 16 INT32

.loop:
  ld1  {v0.16b-v3.16b}, [x1], #64      // 4 rows of A
  ld1  {v4.16b-v7.16b}, [x2], #64      // 4 cols of B

  sdot v16.4s, v4.16b, v0.4b[0]        // C[0,:]
  sdot v17.4s, v4.16b, v0.4b[1]
  sdot v18.4s, v4.16b, v0.4b[2]
  sdot v19.4s, v4.16b, v0.4b[3]
  ...                                  // 16 sdots total
  subs w3, w3, #4
  bne  .loop

// 16 sdots × 32 MACs each = 512 MACs / iter

Crypto Extensions

Instruction(s)	Feature	Purpose
AESE / AESD / AESMC / AESIMC	AES (FEAT_AES)	One-round AES encrypt / decrypt / mix-columns
PMULL / PMULL2	PMULL (FEAT_PMULL)	Carry-less multiply — GF(2^64), GCM, CRC-32
SHA1C / SHA1P / SHA1M / SHA1H	SHA1	SHA-1 hash rounds (legacy)
SHA256H / SHA256H2 / SHA256SU0/1	SHA256	SHA-256 rounds; mandatory in Armv8.2
SHA512H / SHA512H2 / SHA512SU0/1	SHA3 (SVE2)	SHA-512 rounds
SM3, SM4	SM4/SM3	Chinese national cipher + hash
EOR3 / RAX1 / XAR / BCAX	SHA3	Keccak / SHA-3 primitives

TLS / AES-GCM / file-system encryption all run at >10 GB/s on modern Cortex-A flagships because of these. A block cipher that would cost >20 cycles/byte in scalar code costs <0.5 cycles/byte with AES+PMULL.

Compiler Auto-vectorization Reality

NEON auto-vec — LLVM and GCC have strong auto-vec for simple loops. Gets ~60-80% of hand-tuned for straight-line arithmetic.
SVE auto-vec — GCC 10+ and LLVM 13+ produce decent SVE for HPC loops. VLA loops compile well.
Fails at: data-dependent branches, gather-heavy patterns, non-contiguous strides, histogram-like reductions.
For those you drop to:
- NEON intrinsics (#include <arm_neon.h>) — type-safe wrappers, same IR as inline asm.
- SVE ACLE intrinsics (<arm_sve.h>) — svint32_t, svbool_t, compile to VLA code.
- Inline assembly — last resort.

// SVE2 ACLE intrinsic — daxpy (VLA)
#include <arm_sve.h>

void daxpy(double *y, const double *x,
           double a, size_t n) {
  svfloat64_t va = svdup_f64(a);
  size_t i = 0;
  svbool_t pg = svwhilelt_b64(i, n);
  while (svptest_first(svptrue_b64(), pg)) {
    svfloat64_t vx = svld1_f64(pg, x + i);
    svfloat64_t vy = svld1_f64(pg, y + i);
    vy = svmad_f64_m(pg, vx, va, vy);
    svst1_f64(pg, y + i, vy);
    i += svcntd();
    pg = svwhilelt_b64(i, n);
  }
}

OS & ABI Considerations

Context-switch cost: saving/restoring V0-V31 = 512 B. SVE at 2048-bit = 8 KB. SME ZA tile at SVL=256B = 64 KB.
Linux uses lazy FP context — first use in a new task triggers a trap; restoration on demand.
SME adds a whole new streaming-SVE mode with its own state bit (PSTATE.SM). OS must save/restore on both mode and process boundaries.
Syscalls on AArch64 should set PSTATE.SM=0 before returning — some kernel services clobber Z-regs.
Android / Linux discovery via HWCAP and HWCAP2 bits in /proc/self/auxv — HWCAP_ASIMD, HWCAP_SVE, HWCAP2_SVE2, HWCAP2_SME.

Apple AMX vs SME

Apple has shipped AMX (undocumented co-processor) since M1 / A14. SME is the Arm-architectural equivalent — but Apple has not (yet) implemented SME. Apple's M4 still ships AMX. Expected convergence as Apple adopts SME in a future chip.

Don't forget tagging

SME state interacts with MTE: SME store instructions must respect tag checks. SVE gather/scatter similarly — each lane's PA is tag-checked independently.

Lessons

"Why SVE over NEON?" → vector-length agnostic; one binary vectorises efficiently across phones (128-bit) and servers (256/512/2048-bit). Predication eliminates tail code.
"What's SDOT doing?" → 4× INT8 × INT8 → INT32 accumulate per lane; the ML / quantised-inference fast path.
"What's in SVE2 that SVE1 doesn't have?" → NEON-equivalent integer/saturating/match ops + bf16/INT8 matmul. SVE1 was HPC-only; SVE2 brings it to mobile.

"Why does SME exist if SVE can do matmul?" → SVE needs N instructions for an outer product; SME does it in one, with a dedicated tile register file. ~4-8× higher throughput on GEMM.
"Fixed vs scalable vector width?" → fixed requires fresh binaries per target; scalable lets one binary run on everything from A510 (128-bit) to Neoverse V2 (256-bit) to Fugaku (512-bit).
"What's the first-fault register?" → FFR in SVE — tracks which lanes of a LDFF1* load succeeded; enables safe vectorised strlen.

References

Arm Ltd. — DDI 0487 — canonical spec for NEON, SVE, SVE2, SME
Arm Ltd. — Arm C Language Extensions (ACLE) — intrinsic reference
Arm Ltd. — SVE Programming Guide (Arm 100891), freely downloadable
Stephens, Biles, Boettcher et al. — "The Arm Scalable Vector Extension" (IEEE Micro, 2017)
Fujitsu Ltd. — A64FX Microarchitecture Manual — first SVE implementation reference
AnandTech — Andrei Frumusanu reviews of X2/A710 SVE2 throughput (2022)
Dougall Johnson — dougallj.wordpress.com — Apple AMX reverse-engineering pieces (for SME context)
Chipsandcheese — deep-dives on Cortex-X and Neoverse V-series FP/SIMD back-ends

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.