| Year | Extension | Width | Shipping in | Key idea |
|---|---|---|---|---|
| 2005 | NEON (v1) | 128-bit fixed | Cortex-A8 | First Arm SIMD; integer + SP float |
| 2011 | NEON + full DP | 128-bit fixed | Armv8-A (A53/A57) | Unified with FPU; 32 × 128-bit V-regs |
| 2016 | FP16 + dot-product (SDOT/UDOT) | 128-bit fixed | Armv8.2-A, A75+ | INT8 × INT8 → INT32 accumulate; ML |
| 2017 | SVE (optional) | 128 – 2048 bit | Fujitsu A64FX only | Vector-Length Agnostic; HPC |
| 2020 | bfloat16 / matmul-INT8 | 128-bit fixed | Armv8.6-A | ML training dtypes in NEON |
| 2021 | SVE2 (mandatory) | 128 – 2048 bit | Armv9-A (X2, A710, A510) | SVE for mobile; replaces NEON for new code |
| 2023 | SME / SME2 | SVL × SVL tiles | Armv9.2-A | Matrix outer-product; streaming mode |
// memcpy tail using NEON 128-bit loads
// (glibc-style AArch64 memcpy)
ldp q0, q1, [x1, #-32] // last 32 bytes
ldr q2, [x1, #-48] // prior 16
stp q0, q1, [x0, #-32]
str q2, [x0, #-48]
// SDOT (INT8 × INT8 → INT32 accumulate)
sdot v3.4s, v4.16b, v5.16b
// per-lane: v3[i] += sum(v4[4i..4i+3] * v5[4i..4i+3])
// 4x4 matmul kernel (FP32)
fmla v0.4s, v4.4s, v8.s[0]
fmla v0.4s, v5.4s, v8.s[1]
fmla v0.4s, v6.4s, v8.s[2]
fmla v0.4s, v7.4s, v8.s[3]
SVE code compiles once for all widths 128–2048 in 128-bit increments. The runtime vector length is queried via RDVL; loops use WHILELT predicates that auto-mask the last iteration.
512-bit SVE, 48 cores per PE, HBM2, 2.7 PF/s. Powered Fugaku, #1 Top500 in 2020-21. Proved Arm could do sustained HPC.
// Canonical SVE loop — daxpy y[i] += a*x[i]
mov x3, #0 // i = 0
whilelt p0.d, x3, x4 // p0 = (i < n) mask
b.none 2f
1:
ld1d z0.d, p0/z, [x0, x3, lsl #3] // x[i..]
ld1d z1.d, p0/z, [x1, x3, lsl #3] // y[i..]
fmla z1.d, p0/m, z0.d, z2.d // y += a*x (masked)
st1d z1.d, p0, [x1, x3, lsl #3] // y[i..]
incd x3 // i += VL/8
whilelt p0.d, x3, x4
b.first 1b // loop if any active
2:
ret
In NEON, a loop processes 4 × FP32 per iteration and you need a post-loop to handle the 1-3 leftover elements. In SVE, the same loop's last iteration has fewer active lanes — no tail code at all.
LDFF1D — load that delivers as many elements as were legal; sets FFR (First Fault Register) to the mask of successful lanes. Enables a vectorised strlen that never reads past the zero byte.
// SVE2 string match — compare 16 input bytes
// against 16 candidate bytes, lane-wise ANY-match
ptrue p0.b // all lanes active
ld1b z0.b, p0/z, [x0] // input
ld1b z1.b, p0/z, [x1] // candidates
match p1.b, p0/z, z0.b, z1.b // p1 = any match
// BF16 matmul — 4x4 block of FP32 accumulator,
// packed BF16 A and B
bfmmla z0.s, z4.h, z8.h
// accumulates 2x2 outer-product of 4-element dot products
Regular (non-streaming) SVE uses the CPU's FP/SIMD unit. Streaming SVE borrows a dedicated hardware tile + outer-product engine. Switching modes is software-controlled via PSTATE.SM and SMSTART / SMSTOP.
Modern ML is dominated by matmul. A tile of 128×128 FP16 = 32 KB; a streaming outer-product issues that much MAC per instruction. Dramatically higher throughput than SVE vector ops.
| Aspect | NEON | SVE / SVE2 | SME |
|---|---|---|---|
| Width | 128-bit fixed | 128-2048 bit, scalable | SVL × SVL tile |
| Predication | No (mask+blend) | Yes (16 P-regs) | Yes + 2-D |
| Tail handling | Scalar epilogue | WHILELT auto-mask | Tile slice masks |
| Gather/scatter | TBL only | Native | Multi-vector load |
| Best for | Media, codecs, crypto | HPC, BLAS, ML inference | GEMM / LLM |
| First shipped | Cortex-A8 (2009) | A64FX (2020), Cortex X2 (2022) | TBD (2025 flagships) |
| Mandatory? | v8.0-A | v9-A (SVE2) | v9.2-A optional |
In practice: code targeting mobile + server should prefer SVE2 going forward. NEON intrinsics remain for legacy + codec libraries. SME is for GEMM-heavy kernels (LLM inference, cblas_sgemm).
// INT8 GEMM micro-kernel (NEON)
// A: 4 x K, B: K x 16, C: 4 x 16 INT32
.loop:
ld1 {v0.16b-v3.16b}, [x1], #64 // 4 rows of A
ld1 {v4.16b-v7.16b}, [x2], #64 // 4 cols of B
sdot v16.4s, v4.16b, v0.4b[0] // C[0,:]
sdot v17.4s, v4.16b, v0.4b[1]
sdot v18.4s, v4.16b, v0.4b[2]
sdot v19.4s, v4.16b, v0.4b[3]
... // 16 sdots total
subs w3, w3, #4
bne .loop
// 16 sdots × 32 MACs each = 512 MACs / iter
| Instruction(s) | Feature | Purpose |
|---|---|---|
| AESE / AESD / AESMC / AESIMC | AES (FEAT_AES) | One-round AES encrypt / decrypt / mix-columns |
| PMULL / PMULL2 | PMULL (FEAT_PMULL) | Carry-less multiply — GF(2^64), GCM, CRC-32 |
| SHA1C / SHA1P / SHA1M / SHA1H | SHA1 | SHA-1 hash rounds (legacy) |
| SHA256H / SHA256H2 / SHA256SU0/1 | SHA256 | SHA-256 rounds; mandatory in Armv8.2 |
| SHA512H / SHA512H2 / SHA512SU0/1 | SHA3 (SVE2) | SHA-512 rounds |
| SM3*, SM4* | SM4/SM3 | Chinese national cipher + hash |
| EOR3 / RAX1 / XAR / BCAX | SHA3 | Keccak / SHA-3 primitives |
TLS / AES-GCM / file-system encryption all run at >10 GB/s on modern Cortex-A flagships because of these. A block cipher that would cost >20 cycles/byte in scalar code costs <0.5 cycles/byte with AES+PMULL.
// SVE2 ACLE intrinsic — daxpy (VLA)
#include <arm_sve.h>
void daxpy(double *y, const double *x,
double a, size_t n) {
svfloat64_t va = svdup_f64(a);
size_t i = 0;
svbool_t pg = svwhilelt_b64(i, n);
while (svptest_first(svptrue_b64(), pg)) {
svfloat64_t vx = svld1_f64(pg, x + i);
svfloat64_t vy = svld1_f64(pg, y + i);
vy = svmad_f64_m(pg, vx, va, vy);
svst1_f64(pg, y + i, vy);
i += svcntd();
pg = svwhilelt_b64(i, n);
}
}
Apple has shipped AMX (undocumented co-processor) since M1 / A14. SME is the Arm-architectural equivalent — but Apple has not (yet) implemented SME. Apple's M4 still ships AMX. Expected convergence as Apple adopts SME in a future chip.
SME state interacts with MTE: SME store instructions must respect tag checks. SVE gather/scatter similarly — each lane's PA is tag-checked independently.
Arm Ltd. — DDI 0487 — canonical spec for NEON, SVE, SVE2, SME
Arm Ltd. — Arm C Language Extensions (ACLE) — intrinsic reference
Arm Ltd. — SVE Programming Guide (Arm 100891), freely downloadable
Stephens, Biles, Boettcher et al. — "The Arm Scalable Vector Extension" (IEEE Micro, 2017)
Fujitsu Ltd. — A64FX Microarchitecture Manual — first SVE implementation reference
AnandTech — Andrei Frumusanu reviews of X2/A710 SVE2 throughput (2022)
Dougall Johnson — dougallj.wordpress.com — Apple AMX reverse-engineering pieces (for SME context)
Chipsandcheese — deep-dives on Cortex-X and Neoverse V-series FP/SIMD back-ends
Presentation built with Reveal.js 4.6 · Playfair Display + DM Sans + JetBrains Mono
Educational use.