From AES S-Boxes to Post-Quantum NTT Engines —
Architecture, Side-Channel Defense & Silicon Implementation
ASIC FPGA SoC Integration Side-Channel
Software executes sequentially on general-purpose ALUs. Cryptographic algorithms expose massive data-level parallelism and fixed dataflow — ideal for dedicated circuits.
AES in software (OpenSSL, Xeon): ~3–6 Gbps
AES in 45 nm ASIC: 53 Gbps
AES fully-pipelined FPGA: 200+ Gbps
Dedicated datapaths avoid instruction fetch/decode overhead. A compact AES-128 core in 65 nm uses <50 μW — enabling smart-card and IoT deployment.
Hardware enables constant-time execution, masking, and dual-rail logic — countermeasures that are impractical or impossible in software.
Block ciphers (AES, SM4, ChaCha20), stream ciphers, hash functions (SHA-2, SHA-3/Keccak), authenticated encryption (AES-GCM, AES-CCM).
High throughput Fixed datapath
RSA (modular exponentiation), ECC (scalar multiplication on P-256, Curve25519), Diffie-Hellman. Dominated by big-number arithmetic.
Compute-intensive Variable latency
ML-KEM (Kyber), ML-DSA (Dilithium), FALCON, SLH-DSA (SPHINCS+). NTT polynomial multipliers, Keccak-based sampling, rejection loops.
NTT-centric Emerging standard
TLS/SSL handshake, IPsec/MACsec inline encryption, disk encryption (XTS), HDCP content protection. Full protocol-layer integration.
System-level Line-rate
The most widely deployed crypto accelerator in silicon
Each AES round applies four transformations to a 128-bit state matrix. Hardware maps each to a combinational block.
Single round circuit, reused 10×. Minimum area (~2,400 GE for 8-bit datapath). Throughput: 6–94 Mbps.
IoT Smart cards
2–5 rounds instantiated. Area/speed trade-off. Throughput scales linearly with unroll factor.
SoC Balanced
All 10 rounds in pipeline. New block every clock cycle. Throughput: 53–200+ Gbps depending on frequency and datapath width.
Data center Line-rate
SubBytes dominates AES area and critical path. The S-Box computes the multiplicative inverse in GF(2⁸) followed by an affine transformation.
| Method | Area (GE) | Delay (ns) | Side-Channel | Notes |
|---|---|---|---|---|
| Lookup Table (ROM) | ~4,000 | ~1.0 | Vulnerable | 256×8 ROM; cache-timing leaks in SW |
| Composite Field GF((2²)²)² | ~250–350 | ~2.5 | Moderate | Tower decomposition; Canright (2005) |
| Boyar-Peralta Logic Min. | ~113 gates | ~3.0 | Moderate | Boolean minimization heuristics |
| Masked (2-share TI) | ~6,000+ | ~4.0 | Resistant | Threshold implementation, 1st-order DPA safe |
| Redundant GF + Offset | ~200 | ~2.8 | Moderate | Polynomial ring representation; CHES 2024 |
Galois/Counter Mode combines AES-CTR encryption with GHASH authentication. The GHASH unit requires a GF(2¹²⁸) multiplier — a second major hardware block alongside AES.
Keccak operates on a 1600-bit state (5×5×64) through 24 rounds of θ, ρ, π, χ, ι permutations. Its structure maps naturally to hardware — no S-Box lookup tables needed.
Unrolled architecture
with optimized RC generator
Single-round iterative
for IoT/embedded
4× unroll: 4 rounds/cycle
6 cycles per hash
RSA modular exponentiation · ECC scalar multiplication
RSA's core operation is modular exponentiation: C = Me mod n. This reduces to hundreds of modular multiplications over 2048–4096 bit operands.
Replaces expensive trial division with shift-and-add operations in "Montgomery domain." All modular multiplications become: MonPro(a,b) = a·b·R⁻¹ mod n
1 bit/cycle
2048 cycles
Minimal area
w bits/cycle
2048/w cycles
DSP-friendly
Pipelined PEs
Continuous flow
High throughput
Single cycle
Massive area
Maximum speed
ECC computes Q = k·P via repeated point addition and doubling over a 256-bit prime field. The hierarchy of operations creates a natural hardware decomposition.
Edwards25519 hardware: unified point add/double in 646 cycles, full scalar multiplication in 164,730 cycles at 117 MHz → 1.4 ms per operation on Virtex-5.
ML-KEM · ML-DSA · FALCON · NTT acceleration
NIST finalized three lattice-based standards in 2024. All share a common computational bottleneck: polynomial multiplication, accelerated via the Number Theoretic Transform (NTT).
| Standard | Algorithm | Type | Key Operation | Hardware Bottleneck |
|---|---|---|---|---|
| FIPS 203 | ML-KEM (Kyber) | KEM | Polynomial mul (mod q=3329) | NTT + SHA-3/SHAKE |
| FIPS 204 | ML-DSA (Dilithium) | Signature | Polynomial mul (mod q=8380417) | NTT + SHA-3/SHAKE |
| FIPS 205 | SLH-DSA (SPHINCS+) | Signature | Hash tree traversal | SHA-2/SHAKE throughput |
| Draft 206 | FALCON | Signature | FFT Sampling (floating-pt) | Fast Fourier Sampling + NTT |
NTT is the finite-field equivalent of FFT. It converts O(n²) polynomial multiplication into O(n log n) operations modulo a prime q.
Each butterfly stage computes:
Cooley-Tukey (CT): a' = a + ω·b (mod q)
b' = a - ω·b (mod q)
Gentleman-Sande (GS): a' = a + b (mod q)
b' = (a - b)·ω (mod q)
Requires: one modular multiplication + two modular add/sub per butterfly. Montgomery reduction eliminates the expensive division in modular multiplication.
n=256, q=3329
7 butterfly stages
12-bit coefficients
224 cycles (pipelined)
n=256, q=8380417
8 butterfly stages
23-bit coefficients
512 cycles (pipelined)
n=512/1024, q=12289
9–10 stages
14-bit coefficients
Also needs FFT sampling
Single butterfly unit processes coefficients sequentially from RAM. Reuses one BF across all stages.
Area: Minimal (~800 LUTs)
Cycles: n·log₂n
Use: Compact IoT
Streaming pipeline with delay lines between stages. Conflict-free memory access. Data flows continuously.
Area: log₂n BF units
Cycles: ~n (pipelined)
Use: High throughput
Feedback architecture reuses butterfly across radix stages. Pipelined with delay register banks.
Area: Moderate
Cycles: ~n + pipeline depth
Use: Area-efficient pipeline
Processes 4 coefficients per cycle. Halves the number of stages. Optimal for high-performance designs.
Area: 4× butterfly area
Cycles: n/4 per stage
Use: Maximum throughput
Real deployments integrate NTT engines into RISC-V or ARM SoCs via memory-mapped registers or custom instruction set extensions (ISEs).
Architecture based on CRYSTALS-Dilithium post-quantum SoC designs for wired-communication critical systems.
The hardware security dimension
Algorithmically secure crypto can be broken by observing physical leakage during computation. Hardware designers must defend against all channels simultaneously.
SPA — Single trace, visual inspection of operations
DPA — Statistical correlation across many traces
CPA — Correlation with hypothetical power models
HO-DPA — Combines multiple time points to defeat masking
SEMA — Simple EM analysis (like SPA)
DEMA — Differential EM analysis
EM probes can isolate individual circuit blocks — more spatially precise than power analysis.
Variable execution time leaks secret-dependent branches. Cache-timing attacks exploit data-dependent memory access patterns (e.g., AES T-table lookups).
Voltage glitching, clock glitching, laser fault injection. Induces computational errors that reveal key bits via differential fault analysis (DFA).
| Technique | Defends Against | Overhead | Effectiveness |
|---|---|---|---|
| Boolean Masking (d shares) | DPA up to order d−1 | ~d² area | Provable in probing model |
| Threshold Implementation (TI) | 1st/2nd order DPA | 3–5× area | Glitch-resistant; proven |
| Dual-Rail / WDDL Logic | SPA, DPA | 2× area, 2× power | Constant power per op |
| Random Delays / Shuffling | SPA, DPA (reduces SNR) | TRNG + control | Raises attack complexity |
| Constant-Time Design | Timing attacks | ~0% area | Eliminates timing channel |
| Blinding (Asymmetric) | DPA on RSA/ECC | 1 extra mul | Randomizes intermediate values |
| EM Shielding Mesh | EM probing | Top metal layers | Physical barrier |
| Voltage/Clock Monitors | Fault injection | Sensor area | Detects glitch attempts |
TI splits every sensitive variable into d+1 shares that are processed independently. Even if an attacker observes glitch-induced transient leakage, no single share reveals information.
Correctness
Combining shares recovers the correct output: y = y₁ ⊕ y₂ ⊕ ... ⊕ yd+1
Non-Completeness
Each share function fi is independent of at least one input share — prevents leakage
Uniformity
Output sharing is uniformly distributed — no bias that could leak information
// 2-share TI for AES S-Box (simplified concept)
// Each share never sees the complete unmasked value
module ti_sbox (
input [7:0] x_share1, x_share2, // Masked input shares
input [7:0] fresh_random, // Fresh randomness per clock
output [7:0] y_share1, y_share2 // Masked output shares
);
// Non-linear layer uses cross-domain terms with fresh masks
// Linear layers (affine transform) applied to each share independently
// Area: ~6,000 GE vs ~350 GE unmasked — ~17× overhead
endmodule
ASIC · FPGA · SoC · Instruction Set Extensions
| Criterion | ASIC | FPGA |
|---|---|---|
| Performance | Maximum frequency; custom gates 45 nm AES: 53 Gbps | Routing delays limit fmax Kintex US+: 206 Gbps (multi-core) |
| Area/Cost | Smallest silicon per function High NRE ($1M+ for 28 nm tapeout) | ~10–50× more silicon per function Low NRE, per-unit cost higher |
| Power | Lowest — no config overhead 65 nm AES: <50 μW achievable | Higher static + dynamic power Always-on configuration SRAM |
| Flexibility | Fixed at fabrication Bugs require respin | Reconfigurable post-deployment Algorithm agility for PQC migration |
| Time-to-Market | 12–24 months | Weeks to months |
| Side-Channel | Custom logic styles possible Full control over layout | Limited routing control But partial bitstream patching helps |
| Best For | High-volume: smart cards, SoCs, data-center NICs, HSMs | Prototyping, low-volume, crypto-agile systems, cloud FPGAs |
Rather than a fully separate accelerator, ISEs add crypto-specific instructions to the CPU pipeline. They offer a middle ground between pure software and dedicated hardware.
AES-NI — Single-round AES in one instructionPCLMULQDQ — Carryless multiply for GHASHSHA-NI — SHA-1/SHA-256 roundsSM3/SM4 — Chinese national algorithms
AArch64: AESE, AESMC, SHA256H
aes32esi / aes64es — AES encrypt stepsha256sig0 — SHA-256 sigmasm4ed — SM4 encrypt/decryptclmul — Carryless multiply
Custom: RANTT — NTT for PQC
TLS · IPsec · MACsec · Disk Encryption
Modern network SoCs embed crypto engines directly in the data path. The CPU never touches plaintext packets — the engine encrypts/decrypts at wire speed.
Security Association (SA) database in on-chip TCAM. Supports 10K+ concurrent tunnels. ESP encrypt + HMAC at 100+ Gbps.
Layer-2 hop-by-hop encryption. AES-GCM-256 at line rate. Critical for industrial networks and data-center fabric.
Asymmetric (RSA/ECC handshake) + symmetric (AES-GCM record layer). SmartNIC designs offload both from host CPU.
SystemVerilog implementations of key building blocks
module aes_round (
input logic [127:0] state_in,
input logic [127:0] round_key,
input logic is_last_round, // Skip MixColumns on round 10
output logic [127:0] state_out
);
logic [127:0] after_sub, after_shift, after_mix;
// SubBytes: 16 parallel S-Box instances (composite field)
genvar i;
generate
for (i = 0; i < 16; i++) begin : gen_sbox
aes_sbox_cf u_sbox (
.in (state_in[8*i +: 8]),
.out (after_sub[8*i +: 8])
);
end
endgenerate
// ShiftRows: fixed byte permutation (zero-cost in hardware — just wiring)
assign after_shift = {
after_sub[127:120], after_sub[ 87: 80], after_sub[ 47: 40], after_sub[ 7: 0],
after_sub[ 95: 88], after_sub[ 55: 48], after_sub[ 15: 8], after_sub[103: 96],
after_sub[ 63: 56], after_sub[ 23: 16], after_sub[111:104], after_sub[ 71: 64],
after_sub[ 31: 24], after_sub[119:112], after_sub[ 79: 72], after_sub[ 39: 32]
};
// MixColumns: GF(2^8) matrix multiply per column
generate
for (i = 0; i < 4; i++) begin : gen_mixcol
aes_mixcolumn u_mix (
.in (after_shift[32*i +: 32]),
.out (after_mix[32*i +: 32])
);
end
endgenerate
// AddRoundKey: 128-bit XOR
assign state_out = (is_last_round ? after_shift : after_mix) ^ round_key;
endmodule
// AES S-Box via tower field: GF(2^8) → GF((2^4)^2) → GF(((2^2)^2)^2)
// No ROM — pure combinational logic (~300 gates)
module aes_sbox_cf (
input logic [7:0] in,
output logic [7:0] out
);
logic [7:0] mapped, inv_mapped;
logic [3:0] hi, lo, inv_hi, inv_lo;
logic [3:0] sum_hl, sq_scale, prod, inv_d, t;
// Step 1: Isomorphic mapping GF(2^8) → GF((2^4)^2)
assign mapped = iso_map(in); // 8×8 binary matrix multiply
assign hi = mapped[7:4];
assign lo = mapped[3:0];
// Step 2: Compute inverse in GF((2^4)^2)
assign sum_hl = hi ^ lo; // GF(2^4) add
assign sq_scale = gf16_sq_scale(hi); // hi² · λ (combined)
assign prod = gf16_mul(sum_hl, lo); // (hi⊕lo) · lo
assign t = sq_scale ^ prod; // Norm: hi²λ ⊕ (hi⊕lo)·lo
assign inv_d = gf16_inv(t); // Inversion in GF(2^4)
assign inv_hi = gf16_mul(sum_hl, inv_d); // (hi⊕lo) · inv_d
assign inv_lo = gf16_mul(hi, inv_d); // hi · inv_d
// Step 3: Inverse map + affine transformation
assign inv_mapped = inv_iso_map({inv_hi, inv_lo});
assign out = affine_transform(inv_mapped);
endmodule
Each gf16_* function decomposes further into GF(2²) arithmetic — ultimately reducing to AND/XOR gates. The deepest path is ~22 XOR + ~12 AND gate delays.
// Cooley-Tukey NTT butterfly for ML-KEM (Kyber)
// q = 3329, coefficients are 12-bit
module ntt_butterfly #(
parameter Q = 3329,
parameter WIDTH = 12
)(
input logic clk, rst_n,
input logic [WIDTH-1:0] a_in, b_in, // Input coefficients
input logic [WIDTH-1:0] omega, // Twiddle factor (root of unity)
output logic [WIDTH-1:0] a_out, b_out // Output coefficients
);
logic [2*WIDTH-1:0] prod;
logic [WIDTH-1:0] omega_b, sum_val, diff_val;
// Modular multiplication: ω · b (mod q) via Montgomery
assign prod = omega * b_in; // Full-width multiply
// Montgomery reduction: t = prod · q_inv mod R; result = (prod + t·q) >> R
montgomery_reduce #(.Q(Q), .W(WIDTH)) u_mont (
.product (prod),
.result (omega_b)
);
// Butterfly: a' = a + ω·b (mod q), b' = a - ω·b (mod q)
mod_add_sub #(.Q(Q), .W(WIDTH)) u_add (
.a(a_in), .b(omega_b), .sum(sum_val), .diff(diff_val)
);
// Pipeline register
always_ff @(posedge clk or negedge rst_n)
if (!rst_n) begin a_out <= '0; b_out <= '0; end
else begin a_out <= sum_val; b_out <= diff_val; end
endmodule
// Multiplier-less Montgomery reduction for Kyber (q = 3329)
// Exploits q = 13·256 + 1 structure for efficient constant multiply
module montgomery_reduce #(
parameter Q = 3329,
parameter Q_INV = 62209, // q^(-1) mod 2^16
parameter W = 12
)(
input logic [2*W-1:0] product, // 24-bit input (a × b)
output logic [W-1:0] result // Reduced to [0, q)
);
logic [15:0] t;
logic [31:0] u;
logic signed [W:0] r;
// Step 1: t = (product mod R) × Q_INV mod R
// where R = 2^16
assign t = product[15:0] * Q_INV[15:0]; // Low 16 bits only
// Step 2: u = (product + t × Q) >> 16
assign u = product + (t * Q);
// Step 3: Conditional subtraction
assign r = u[31:16];
assign result = (r >= Q) ? (r - Q) : r[W-1:0];
endmodule
Key optimization: For Kyber's q=3329, the constant multiply by Q_INV can be decomposed into shifts and adds, eliminating the need for a hardware multiplier entirely — reducing the butterfly to pure shift-XOR-add logic.
Area · Throughput · Power · Comparisons
| Implementation | Platform | Throughput | Area | Efficiency |
|---|---|---|---|---|
| OpenSSL (SW, Xeon) | x86 + AES-NI | ~6 Gbps | — | CPU-bound |
| 8-bit iterative | 90 nm ASIC | 94 Mbps | 2,645 GE | 35.6 Kbps/GE |
| Composite field 128b | 45 nm ASIC | 53 Gbps | ~100K GE | 530 Kbps/GE |
| Fully pipelined | Artix-7 FPGA | 72 Gbps | ~4K Slices | 18 Mbps/Slice |
| Multi-core (x4) | KU+ FPGA | 742 Gbps | ~16K Slices | 46 Mbps/Slice |
| TI masked (2-share) | 180 nm ASIC | ~1 Gbps | ~40K GE | DPA resistant |
| AES-GCM SoC | Virtex-5 FPGA | 25.6 Gbps | 69% of SoC | 4-parallel engine |
| Design | Platform | Algorithm | NTT Cycles | Area (LUTs) | ATP Improvement |
|---|---|---|---|---|---|
| SDF-NTT (pipelined) | Artix-7 | ML-KEM | 224 | ~4K | 49.6% vs. SoA |
| Unified NTT (4×BF) | Artix-7 | ML-KEM + ML-DSA | 224 / 512 | ~6K | 82% (Kyber) |
| MDC-NTT | Kintex-7 | ML-KEM | 5,551 total | ~12K | 38.9% latency↓ |
| FalconSign | 28 nm ASIC | FALCON-512 | — | 0.71 mm² | 5.1× vs. SoA |
| RISC-V + OTBN ISE | OpenTitan | Dilithium verify | — | RoT SoC | PQC on RoT |
| HLS batch accelerator | Alveo FPGA | ML-KEM + ML-DSA | batch | ~50K | High-volume server |
Post-quantum hardware is rapidly maturing. Unified Kyber+Dilithium accelerators sharing NTT/Keccak datapaths will become the standard building block for PQC-ready SoCs.
PQC migration · FHE acceleration · Crypto agility
The PQC transition requires hardware that can switch algorithms post-deployment. Reconfigurable engines and ISEs provide agility that fixed ASIC logic cannot. Hybrid classical+PQC modes require parallel datapaths.
FHE demands NTT operations on polynomials with millions of coefficients and 100+ bit moduli. FPGA accelerators (FAB) achieve 456× speedup over CPU. Custom ASIC designs target practical bootstrapping times.
FALCON requires floating-point discrete Gaussian sampling — unusual for crypto hardware. ASIC implementations at 28 nm achieve 5.2K signatures/sec with dedicated FFT sampling engines.
ZK-SNARKs/STARKs require massive NTT/MSM (multi-scalar multiplication) computations. GPU and FPGA accelerators race to make real-time ZK proofs practical for blockchain and privacy applications.
FIPS 197 — Advanced Encryption Standard (AES)
FIPS 198 — HMAC
FIPS 202 — SHA-3 (Keccak)
FIPS 203 — ML-KEM (Kyber) — Aug 2024
FIPS 204 — ML-DSA (Dilithium) — Aug 2024
FIPS 205 — SLH-DSA (SPHINCS+) — Aug 2024
FIPS 140-3 — Security Requirements for Crypto Modules
SP 800-186 — Elliptic Curve Recommendations
Canright, "A Very Compact Rijndael S-box" (CHES 2005)
Boyar & Peralta, "New Logic Minimization Techniques" (2010)
Kocher, Jaffe & Jun, "Differential Power Analysis" (CRYPTO 1999)
Koç et al., "Finite Field Arithmetic for Cryptography" (IEEE 2010)
Montgomery, "Modular Multiplication Without Trial Division" (1985)
Hasan, "53 Gbps Composite-Field AES" (JSSC 2011, 45 nm)
Ouyang et al., "FalconSign" (TCHES 2025, 28 nm)
FAB: FPGA Accelerator for Bootstrappable FHE (HPCA 2023)
EMINEM: Mixed-Radix NTT for PQC (ACM TRETS 2025)
OpenTitan: Open-Source Silicon Root of Trust
Rambus DPA Countermeasures & TVLA Methodology
Nikova, Rechberger & Rijmen, "Threshold Implementations" (2006)
Mangard, Oswald & Popp, "Power Analysis Attacks" (Springer)
ChipWhisperer Open-Source SCA Platform
Symmetric engines — AES pipeline architectures from 2.6K gates to 742 Gbps, S-Box design via composite field decomposition, AES-GCM authenticated encryption with GF(2¹²⁸) GHASH.
Asymmetric engines — Montgomery multiplication architectures (bit-serial to systolic), ECC scalar multiplication with projective coordinates, FPGA DSP block exploitation.
Post-quantum engines — NTT butterfly architectures (SDF, MDC, mixed-radix), unified Kyber+Dilithium datapaths, Keccak/SHAKE as shared infrastructure, RISC-V SoC integration.
Side-channel defense — Boolean masking, threshold implementation, dual-rail logic, TVLA validation, fault injection countermeasures.
Implementation — ASIC vs. FPGA trade-offs, ISE extensions (AES-NI, RISC-V Zkn), protocol offload engines, certification requirements.