ARM CORTEX-A · PRESENTATION 04

Vector Extensions — NEON, SVE, SVE2, SME

Thirty years of Arm SIMD · Fixed-width → Scalable → Matrix
NEON · SVE · SVE2 · SME · Bfloat16 · INT8 dot-product · Crypto · Fujitsu A64FX
02

Evolution of Arm SIMD

YearExtensionWidthShipping inKey idea
2005NEON (v1)128-bit fixedCortex-A8First Arm SIMD; integer + SP float
2011NEON + full DP128-bit fixedArmv8-A (A53/A57)Unified with FPU; 32 × 128-bit V-regs
2016FP16 + dot-product (SDOT/UDOT)128-bit fixedArmv8.2-A, A75+INT8 × INT8 → INT32 accumulate; ML
2017SVE (optional)128 – 2048 bitFujitsu A64FX onlyVector-Length Agnostic; HPC
2020bfloat16 / matmul-INT8128-bit fixedArmv8.6-AML training dtypes in NEON
2021SVE2 (mandatory)128 – 2048 bitArmv9-A (X2, A710, A510)SVE for mobile; replaces NEON for new code
2023SME / SME2SVL × SVL tilesArmv9.2-AMatrix outer-product; streaming mode
03

NEON — Advanced SIMD

  • 32 × 128-bit V-registers in AArch64 (V0–V31). In AArch32 they overlay as 32 × D-regs or 16 × Q-regs.
  • Each V-reg can be addressed as:
    • 16 × B (byte), 8 × H (halfword), 4 × S (word), 2 × D (doubleword)
    • 4 × FP32, 8 × FP16 (if FP16 feature), 2 × FP64
  • Lane-wise arithmetic: ADD V0.4S, V1.4S, V2.4S.
  • Saturating arithmetic, pairwise ops (ADDP, UADALP), cross-lane shuffles (EXT, UZP, ZIP, TRN, TBL).
  • Widening / narrowing pairs: UADDL (add, widen), ADDHN (add high narrow).
  • Crypto extension piggybacks on V-regs: AESE/AESD/AESMC, SHA1H, SHA256H, PMULL (GF(2^64)).
// memcpy tail using NEON 128-bit loads
// (glibc-style AArch64 memcpy)

ldp   q0, q1, [x1, #-32]    // last 32 bytes
ldr   q2, [x1, #-48]        // prior 16
stp   q0, q1, [x0, #-32]
str   q2, [x0, #-48]

// SDOT (INT8 × INT8 → INT32 accumulate)
sdot  v3.4s, v4.16b, v5.16b
// per-lane: v3[i] += sum(v4[4i..4i+3] * v5[4i..4i+3])

// 4x4 matmul kernel (FP32)
fmla  v0.4s, v4.4s, v8.s[0]
fmla  v0.4s, v5.4s, v8.s[1]
fmla  v0.4s, v6.4s, v8.s[2]
fmla  v0.4s, v7.4s, v8.s[3]
04

Why NEON Needed a Successor

  • Fixed 128-bit width. Compiler has to pick: vectorise for 128-bit or widen the ABI and break compat. HPC wants 512-bit; mobile wants 128-bit; both want the same binary.
  • Tail handling is painful. Loop epilogues for non-multiple-of-4 counts dominate code size and branch footprint.
  • No predication. Conditional execution per-lane requires mask + blend sequences; expensive.
  • Limited gather/scatter. NEON can do TBL-based shuffles, but no native indexed load/store like SSE / AVX-512.
  • These gaps made Arm uncompetitive in HPC and ML training, which is why SVE was designed in close collaboration with Fujitsu for the K successor (Fugaku).

The width-agnostic trick

SVE code compiles once for all widths 128–2048 in 128-bit increments. The runtime vector length is queried via RDVL; loops use WHILELT predicates that auto-mask the last iteration.

Fujitsu A64FX

512-bit SVE, 48 cores per PE, HBM2, 2.7 PF/s. Powered Fugaku, #1 Top500 in 2020-21. Proved Arm could do sustained HPC.

05

SVE — Scalable Vector Extension

  • 32 × Z-registers of implementation-defined length (128 – 2048 bit in 128-bit steps). Low 128 bits of each overlays its NEON V-reg.
  • 16 × P-registers (predicate), 1 bit per byte of vector — so up to 256 predicate bits.
  • VLA (Vector-Length-Agnostic) programming: loops written with WHILELT and INCP, no explicit loop count.
  • First-fault loads — a vector load that doesn't fault on later elements even if they would; enables strlen-style ops.
  • Gather / scatter: indexed load/store in one instruction.
  • Implementations: A64FX (512-bit), Graviton 3 (256-bit), NVIDIA Grace (128-bit per core), Microsoft Cobalt (128-bit). Phone SoCs so far: 128-bit in SVE2.
// Canonical SVE loop — daxpy y[i] += a*x[i]

  mov     x3, #0                // i = 0
  whilelt p0.d, x3, x4          // p0 = (i < n) mask
  b.none  2f

1:
  ld1d    z0.d, p0/z, [x0, x3, lsl #3]   // x[i..]
  ld1d    z1.d, p0/z, [x1, x3, lsl #3]   // y[i..]
  fmla    z1.d, p0/m, z0.d, z2.d         // y += a*x (masked)
  st1d    z1.d, p0,   [x1, x3, lsl #3]   // y[i..]
  incd    x3                              // i += VL/8
  whilelt p0.d, x3, x4
  b.first 1b                               // loop if any active
2:
  ret
06

Predication — Per-lane Masking

  • SVE operations take an optional governing predicate: p0/z (zeroing inactive) or p0/m (merging — leave dst unchanged).
  • Compare ops write a predicate: CMPGT P1.S, P0/Z, Z0.S, Z1.S.
  • WHILELT/WHILELE/WHILELO/WHILELS create loop-termination predicates (signed/unsigned, inclusive/exclusive). Key to VLA loops.
  • PTRUE / PFALSE set an all-true / all-false predicate.
  • Logical predicate ops (AND/ORR/EOR/BIC between predicates) let you combine masks.
  • INCP / DECP — increment X-register by active lane count. Replaces loop-counter math.

No separate "scalar tail"

In NEON, a loop processes 4 × FP32 per iteration and you need a post-loop to handle the 1-3 leftover elements. In SVE, the same loop's last iteration has fewer active lanes — no tail code at all.

First-fault loads

LDFF1D — load that delivers as many elements as were legal; sets FFR (First Fault Register) to the mask of successful lanes. Enables a vectorised strlen that never reads past the zero byte.

07

SVE2 — SVE for Everyone (Armv9-A)

  • Armv9-A mandates SVE2. First implementations: Cortex-X2, A710, A510 (all 128-bit).
  • Adds hundreds of instructions borrowed from NEON that were missing in SVE1:
    • Complex FP (CADD, CMLA)
    • Saturating integer, rounding shifts
    • Bitwise permutes & interleaves
    • String-match ops (MATCH, NMATCH) — serves memcmp, strlen, strchr directly
    • Cryptography: SHA3, SM3/SM4, PMULL128
  • Bf16 + INT8 matmul instructions (BFMMLA, SMMLA, UMMLA) — the ML / LLM-inference workhorse.
  • First-party mandates SVE2 for new Armv9-A code. Expect NEON to stick around for 10+ years but new-kernel/toolchain energy goes into SVE2.
// SVE2 string match — compare 16 input bytes
// against 16 candidate bytes, lane-wise ANY-match

    ptrue   p0.b                     // all lanes active
    ld1b    z0.b, p0/z, [x0]         // input
    ld1b    z1.b, p0/z, [x1]         // candidates
    match   p1.b, p0/z, z0.b, z1.b   // p1 = any match

// BF16 matmul — 4x4 block of FP32 accumulator,
// packed BF16 A and B

    bfmmla  z0.s, z4.h, z8.h
// accumulates 2x2 outer-product of 4-element dot products
08

SME — Scalable Matrix Extension (Armv9.2-A)

  • Adds a 2-D tile storage: ZA, SVL × SVL bytes (SVL = streaming vector length, implementation-defined, power-of-two from 128 to 2048 bits).
  • Four "slices" of ZA can be accessed as 1-D SVE vectors: ZA0H / ZA0V etc.
  • Core instruction: outer product into ZA tile. Example: FMOPA FP32 outer product into a ZA.S tile.
  • Runs in a new "streaming SVE" mode — SVE ops operate at SVL (can be larger than SVE's normal VL).
  • Target: GEMM workloads, LLM matmul, DSP convolutions — the same territory Apple AMX has been in since 2019.
  • SME2 (v9.3) expands with multi-vector loads and FP32-to-BF16 converts — plus RME hooks for confidential ML.

Streaming mode

Regular (non-streaming) SVE uses the CPU's FP/SIMD unit. Streaming SVE borrows a dedicated hardware tile + outer-product engine. Switching modes is software-controlled via PSTATE.SM and SMSTART / SMSTOP.

Why matrices

Modern ML is dominated by matmul. A tile of 128×128 FP16 = 32 KB; a streaming outer-product issues that much MAC per instruction. Dramatically higher throughput than SVE vector ops.

09

NEON vs SVE vs SME — Pick the Tool

AspectNEONSVE / SVE2SME
Width128-bit fixed128-2048 bit, scalableSVL × SVL tile
PredicationNo (mask+blend)Yes (16 P-regs)Yes + 2-D
Tail handlingScalar epilogueWHILELT auto-maskTile slice masks
Gather/scatterTBL onlyNativeMulti-vector load
Best forMedia, codecs, cryptoHPC, BLAS, ML inferenceGEMM / LLM
First shippedCortex-A8 (2009)A64FX (2020), Cortex X2 (2022)TBD (2025 flagships)
Mandatory?v8.0-Av9-A (SVE2)v9.2-A optional

In practice: code targeting mobile + server should prefer SVE2 going forward. NEON intrinsics remain for legacy + codec libraries. SME is for GEMM-heavy kernels (LLM inference, cblas_sgemm).

10

Integer SIMD Dot-Product — the ML Fast-Path

  • UDOT / SDOT (Armv8.2-A) — 4× INT8 × INT8 → INT32 accumulate, per lane. 128-bit NEON = 16 byte inputs × 2 = 8 INT32 accumulators per instruction = 32 MACs/cycle at 1.0 CPI.
  • SMMLA / UMMLA / USMMLA (Armv8.6-A) — 2×8 × 8×2 matrix multiply into 2×2 INT32, in one instruction = 64 MACs/cycle.
  • Variants with bf16 (BFDOT/BFMMLA) target training-accuracy ML.
  • This is why XNNPACK / Arm Compute Library / LLaMA.cpp's AArch64 path gets 2-3× over plain FP32.
  • The same instructions exist in SVE2 — scaled to the chosen vector length.
// INT8 GEMM micro-kernel (NEON)
// A: 4 x K, B: K x 16, C: 4 x 16 INT32

.loop:
  ld1  {v0.16b-v3.16b}, [x1], #64      // 4 rows of A
  ld1  {v4.16b-v7.16b}, [x2], #64      // 4 cols of B

  sdot v16.4s, v4.16b, v0.4b[0]        // C[0,:]
  sdot v17.4s, v4.16b, v0.4b[1]
  sdot v18.4s, v4.16b, v0.4b[2]
  sdot v19.4s, v4.16b, v0.4b[3]
  ...                                  // 16 sdots total
  subs w3, w3, #4
  bne  .loop

// 16 sdots × 32 MACs each = 512 MACs / iter
11

Crypto Extensions

Instruction(s)FeaturePurpose
AESE / AESD / AESMC / AESIMCAES (FEAT_AES)One-round AES encrypt / decrypt / mix-columns
PMULL / PMULL2PMULL (FEAT_PMULL)Carry-less multiply — GF(2^64), GCM, CRC-32
SHA1C / SHA1P / SHA1M / SHA1HSHA1SHA-1 hash rounds (legacy)
SHA256H / SHA256H2 / SHA256SU0/1SHA256SHA-256 rounds; mandatory in Armv8.2
SHA512H / SHA512H2 / SHA512SU0/1SHA3 (SVE2)SHA-512 rounds
SM3*, SM4*SM4/SM3Chinese national cipher + hash
EOR3 / RAX1 / XAR / BCAXSHA3Keccak / SHA-3 primitives

TLS / AES-GCM / file-system encryption all run at >10 GB/s on modern Cortex-A flagships because of these. A block cipher that would cost >20 cycles/byte in scalar code costs <0.5 cycles/byte with AES+PMULL.

12

Compiler Auto-vectorization Reality

  • NEON auto-vec — LLVM and GCC have strong auto-vec for simple loops. Gets ~60-80% of hand-tuned for straight-line arithmetic.
  • SVE auto-vec — GCC 10+ and LLVM 13+ produce decent SVE for HPC loops. VLA loops compile well.
  • Fails at: data-dependent branches, gather-heavy patterns, non-contiguous strides, histogram-like reductions.
  • For those you drop to:
    • NEON intrinsics (#include <arm_neon.h>) — type-safe wrappers, same IR as inline asm.
    • SVE ACLE intrinsics (<arm_sve.h>) — svint32_t, svbool_t, compile to VLA code.
    • Inline assembly — last resort.
// SVE2 ACLE intrinsic — daxpy (VLA)
#include <arm_sve.h>

void daxpy(double *y, const double *x,
           double a, size_t n) {
  svfloat64_t va = svdup_f64(a);
  size_t i = 0;
  svbool_t pg = svwhilelt_b64(i, n);
  while (svptest_first(svptrue_b64(), pg)) {
    svfloat64_t vx = svld1_f64(pg, x + i);
    svfloat64_t vy = svld1_f64(pg, y + i);
    vy = svmad_f64_m(pg, vx, va, vy);
    svst1_f64(pg, y + i, vy);
    i += svcntd();
    pg = svwhilelt_b64(i, n);
  }
}
13

OS & ABI Considerations

  • Context-switch cost: saving/restoring V0-V31 = 512 B. SVE at 2048-bit = 8 KB. SME ZA tile at SVL=256B = 64 KB.
  • Linux uses lazy FP context — first use in a new task triggers a trap; restoration on demand.
  • SME adds a whole new streaming-SVE mode with its own state bit (PSTATE.SM). OS must save/restore on both mode and process boundaries.
  • Syscalls on AArch64 should set PSTATE.SM=0 before returning — some kernel services clobber Z-regs.
  • Android / Linux discovery via HWCAP and HWCAP2 bits in /proc/self/auxvHWCAP_ASIMD, HWCAP_SVE, HWCAP2_SVE2, HWCAP2_SME.

Apple AMX vs SME

Apple has shipped AMX (undocumented co-processor) since M1 / A14. SME is the Arm-architectural equivalent — but Apple has not (yet) implemented SME. Apple's M4 still ships AMX. Expected convergence as Apple adopts SME in a future chip.

Don't forget tagging

SME state interacts with MTE: SME store instructions must respect tag checks. SVE gather/scatter similarly — each lane's PA is tag-checked independently.

14

Lessons

  • "Why SVE over NEON?" → vector-length agnostic; one binary vectorises efficiently across phones (128-bit) and servers (256/512/2048-bit). Predication eliminates tail code.
  • "What's SDOT doing?" → 4× INT8 × INT8 → INT32 accumulate per lane; the ML / quantised-inference fast path.
  • "What's in SVE2 that SVE1 doesn't have?" → NEON-equivalent integer/saturating/match ops + bf16/INT8 matmul. SVE1 was HPC-only; SVE2 brings it to mobile.
  • "Why does SME exist if SVE can do matmul?" → SVE needs N instructions for an outer product; SME does it in one, with a dedicated tile register file. ~4-8× higher throughput on GEMM.
  • "Fixed vs scalable vector width?" → fixed requires fresh binaries per target; scalable lets one binary run on everything from A510 (128-bit) to Neoverse V2 (256-bit) to Fugaku (512-bit).
  • "What's the first-fault register?" → FFR in SVE — tracks which lanes of a LDFF1* load succeeded; enables safe vectorised strlen.
15

Further Reading

Arm documents

  • DDI 0487 — Arm ARM for A-profile — chapters C2/C3 (NEON), C4 (SVE/SVE2), C5 (SME)
  • Arm Compiler 6 Reference Guide — NEON + SVE ACLE intrinsics
  • ARM IHI 0073 — Procedure Call Standard for AArch64 — scalable-vector ABI section
  • Arm Learn the Architecture: Introducing SME — free blog + PDF series

Practical

  • Arm Compute Library (ComputeLibrary/arm_compute) — production CNN / GEMM kernels in NEON/SVE/SME
  • XNNPACK — Google's mobile-ML kernels; SDOT/UDOT paths on Cortex-A
  • llama.cppggml-cpu/arch/arm/ has NEON, SVE and SME kernels for LLM inference
  • OpenSSL / BoringSSL — AES-NI-style Arm crypto paths
16

References

Arm Ltd.DDI 0487 — canonical spec for NEON, SVE, SVE2, SME
Arm Ltd.Arm C Language Extensions (ACLE) — intrinsic reference
Arm Ltd.SVE Programming Guide (Arm 100891), freely downloadable
Stephens, Biles, Boettcher et al. — "The Arm Scalable Vector Extension" (IEEE Micro, 2017)
Fujitsu Ltd.A64FX Microarchitecture Manual — first SVE implementation reference
AnandTech — Andrei Frumusanu reviews of X2/A710 SVE2 throughput (2022)
Dougall Johnson — dougallj.wordpress.com — Apple AMX reverse-engineering pieces (for SME context)
Chipsandcheese — deep-dives on Cortex-X and Neoverse V-series FP/SIMD back-ends

Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.