Modern Cryptography Series · Presentation 10

Crypto Hardware
Accelerator Design

From AES S-Boxes to Post-Quantum NTT Engines —
Architecture, Side-Channel Defense & Silicon Implementation

ASIC FPGA SoC Integration Side-Channel

Why Crypto in Hardware?

Software executes sequentially on general-purpose ALUs. Cryptographic algorithms expose massive data-level parallelism and fixed dataflow — ideal for dedicated circuits.

⚡ Throughput

AES in software (OpenSSL, Xeon): ~3–6 Gbps
AES in 45 nm ASIC: 53 Gbps
AES fully-pipelined FPGA: 200+ Gbps

🔋 Energy

Dedicated datapaths avoid instruction fetch/decode overhead. A compact AES-128 core in 65 nm uses <50 μW — enabling smart-card and IoT deployment.

🛡️ Security

Hardware enables constant-time execution, masking, and dual-rail logic — countermeasures that are impractical or impossible in software.

Design trade-off space: Every crypto accelerator navigates a five-dimensional trade-off — throughput, area (gate count), power, latency, and side-channel resistance.

Accelerator Taxonomy

Symmetric Engines

Block ciphers (AES, SM4, ChaCha20), stream ciphers, hash functions (SHA-2, SHA-3/Keccak), authenticated encryption (AES-GCM, AES-CCM).

High throughput Fixed datapath

Asymmetric / PKC Engines

RSA (modular exponentiation), ECC (scalar multiplication on P-256, Curve25519), Diffie-Hellman. Dominated by big-number arithmetic.

Compute-intensive Variable latency

Post-Quantum Engines

ML-KEM (Kyber), ML-DSA (Dilithium), FALCON, SLH-DSA (SPHINCS+). NTT polynomial multipliers, Keccak-based sampling, rejection loops.

NTT-centric Emerging standard

Protocol Offload Engines

TLS/SSL handshake, IPsec/MACsec inline encryption, disk encryption (XTS), HDCP content protection. Full protocol-layer integration.

System-level Line-rate

SECTION I

AES Hardware Architecture

The most widely deployed crypto accelerator in silicon

AES-128 Round Architecture

Each AES round applies four transformations to a 128-bit state matrix. Hardware maps each to a combinational block.

SubBytes
16× S-Box
ShiftRows
Byte permutation
MixColumns
GF(2⁸) matrix
AddRoundKey
128-bit XOR

Iterative (Compact)

Single round circuit, reused 10×. Minimum area (~2,400 GE for 8-bit datapath). Throughput: 6–94 Mbps.

IoT Smart cards

Loop-Unrolled

2–5 rounds instantiated. Area/speed trade-off. Throughput scales linearly with unroll factor.

SoC Balanced

Fully Pipelined

All 10 rounds in pipeline. New block every clock cycle. Throughput: 53–200+ Gbps depending on frequency and datapath width.

Data center Line-rate

S-Box Implementation Strategies

SubBytes dominates AES area and critical path. The S-Box computes the multiplicative inverse in GF(2⁸) followed by an affine transformation.

MethodArea (GE)Delay (ns)Side-ChannelNotes
Lookup Table (ROM)~4,000~1.0Vulnerable256×8 ROM; cache-timing leaks in SW
Composite Field GF((2²)²)²~250–350~2.5ModerateTower decomposition; Canright (2005)
Boyar-Peralta Logic Min.~113 gates~3.0ModerateBoolean minimization heuristics
Masked (2-share TI)~6,000+~4.0ResistantThreshold implementation, 1st-order DPA safe
Redundant GF + Offset~200~2.8ModeratePolynomial ring representation; CHES 2024
Canright Decomposition: Maps GF(2⁸) → GF((2⁴)²) → GF(((2²)²)²). Inversion in the smallest subfield uses only AND/XOR gates — no ROM, no lookup. The 45 nm composite-field AES reaching 53 Gbps uses this approach with 16 parallel S-Boxes.

AES-GCM Hardware

Galois/Counter Mode combines AES-CTR encryption with GHASH authentication. The GHASH unit requires a GF(2¹²⁸) multiplier — a second major hardware block alongside AES.

AES-CTR Engine Counter → AES → XOR ciphertext GHASH Unit GF(2¹²⁸) multiply-accumulate Auth Tag Parallelism Options for GHASH: Bit-serial: 128 cycles/block — minimal area, low throughput Karatsuba: Recursive decomposition — 3 sub-multiplications instead of 4 4-parallel pipelined: Process 4 blocks/cycle — enables 100+ Gbps authenticated encryption

SHA-3 (Keccak) Hardware

Keccak operates on a 1600-bit state (5×5×64) through 24 rounds of θ, ρ, π, χ, ι permutations. Its structure maps naturally to hardware — no S-Box lookup tables needed.

36.4
Gbps on Virtex-7

Unrolled architecture
with optimized RC generator

~7k
Slices (compact)

Single-round iterative
for IoT/embedded

24→6
Rounds via unrolling

4× unroll: 4 rounds/cycle
6 cycles per hash

Why Keccak matters for PQC: ML-KEM (Kyber) and ML-DSA (Dilithium) use SHA-3/SHAKE extensively for hashing, sampling, and key derivation. A high-performance Keccak core is essential infrastructure in any post-quantum accelerator.
SECTION II

Public-Key Crypto Engines

RSA modular exponentiation · ECC scalar multiplication

RSA Hardware Architecture

RSA's core operation is modular exponentiation: C = Me mod n. This reduces to hundreds of modular multiplications over 2048–4096 bit operands.

Montgomery Multiplication — The Workhorse

Replaces expensive trial division with shift-and-add operations in "Montgomery domain." All modular multiplications become: MonPro(a,b) = a·b·R⁻¹ mod n

Bit-Serial

1 bit/cycle
2048 cycles
Minimal area

Word-Serial

w bits/cycle
2048/w cycles
DSP-friendly

Systolic Array

Pipelined PEs
Continuous flow
High throughput

Full-Parallel

Single cycle
Massive area
Maximum speed

FPGA Advantage: Modern FPGAs contain embedded DSP blocks (e.g., DSP48E1 on Xilinx) and block RAM — ideal for implementing word-serial Montgomery multipliers. A single DSP block + one BRAM can process 2048-bit RSA encryption.

ECC Scalar Multiplication

ECC computes Q = k·P via repeated point addition and doubling over a 256-bit prime field. The hierarchy of operations creates a natural hardware decomposition.

Scalar Multiplication (k·P) Point Addition Point Doubling Mod Multiply Mod Add/Sub Mod Square Mod Inversion Projective Coordinates eliminate per-point inversion Only one final inversion needed — massive speedup

Edwards25519 hardware: unified point add/double in 646 cycles, full scalar multiplication in 164,730 cycles at 117 MHz → 1.4 ms per operation on Virtex-5.

SECTION III

Post-Quantum Hardware

ML-KEM · ML-DSA · FALCON · NTT acceleration

NIST PQC Standards & Hardware

NIST finalized three lattice-based standards in 2024. All share a common computational bottleneck: polynomial multiplication, accelerated via the Number Theoretic Transform (NTT).

StandardAlgorithmTypeKey OperationHardware Bottleneck
FIPS 203ML-KEM (Kyber)KEMPolynomial mul (mod q=3329)NTT + SHA-3/SHAKE
FIPS 204ML-DSA (Dilithium)SignaturePolynomial mul (mod q=8380417)NTT + SHA-3/SHAKE
FIPS 205SLH-DSA (SPHINCS+)SignatureHash tree traversalSHA-2/SHAKE throughput
Draft 206FALCONSignatureFFT Sampling (floating-pt)Fast Fourier Sampling + NTT
Unified hardware opportunity: ML-KEM and ML-DSA both require NTT butterfly units and Keccak cores. A unified accelerator supporting both schemes shares ~70% of the datapath — the NTT butterfly, modular reduction, and SHAKE engines are common building blocks.

Number Theoretic Transform (NTT)

NTT is the finite-field equivalent of FFT. It converts O(n²) polynomial multiplication into O(n log n) operations modulo a prime q.

NTT Butterfly: The Fundamental Unit

Each butterfly stage computes:

  Cooley-Tukey (CT):    a' = a + ω·b (mod q)
                         b' = a - ω·b (mod q)

  Gentleman-Sande (GS): a' = a + b     (mod q)
                         b' = (a - b)·ω (mod q)

Requires: one modular multiplication + two modular add/sub per butterfly. Montgomery reduction eliminates the expensive division in modular multiplication.

Kyber NTT

n=256, q=3329
7 butterfly stages
12-bit coefficients
224 cycles (pipelined)

Dilithium NTT

n=256, q=8380417
8 butterfly stages
23-bit coefficients
512 cycles (pipelined)

FALCON NTT

n=512/1024, q=12289
9–10 stages
14-bit coefficients
Also needs FFT sampling

NTT Hardware Architectures

Iterative (In-Place)

Single butterfly unit processes coefficients sequentially from RAM. Reuses one BF across all stages.

Area: Minimal (~800 LUTs)
Cycles: n·log₂n
Use: Compact IoT

Multi-Delay Commutator (MDC)

Streaming pipeline with delay lines between stages. Conflict-free memory access. Data flows continuously.

Area: log₂n BF units
Cycles: ~n (pipelined)
Use: High throughput

Single-Path Delay Feedback (SDF)

Feedback architecture reuses butterfly across radix stages. Pipelined with delay register banks.

Area: Moderate
Cycles: ~n + pipeline depth
Use: Area-efficient pipeline

Radix-4 / Mixed-Radix

Processes 4 coefficients per cycle. Halves the number of stages. Optimal for high-performance designs.

Area: 4× butterfly area
Cycles: n/4 per stage
Use: Maximum throughput

State of the art (2024): Pipelined SDF-NTT with multiplier-less Montgomery reduction achieves 49.6% better area-time product than prior designs. Unified architectures support both Kyber (q=3329) and Dilithium (q=8380417) via configurable butterfly width.

PQC Accelerator SoC Integration

Real deployments integrate NTT engines into RISC-V or ARM SoCs via memory-mapped registers or custom instruction set extensions (ISEs).

RISC-V CPU Control plane Rejection sampling Protocol logic AXI / Wishbone Interconnect NTT Engine 4× Radix-2 BF Montgomery reduce Coefficient RAM Keccak/SHAKE SHA-3 256/512 SHAKE-128/256 Sampling engine TRNG / AES Entropy source AES-256 engine Key storage Shared Polynomial Memory Dual-port SRAM · Conflict-free access pattern · DMA interface

Architecture based on CRYSTALS-Dilithium post-quantum SoC designs for wired-communication critical systems.

SECTION IV

Side-Channel Attacks
& Countermeasures

The hardware security dimension

Side-Channel Attack Taxonomy

Algorithmically secure crypto can be broken by observing physical leakage during computation. Hardware designers must defend against all channels simultaneously.

⚡ Power Analysis

SPA — Single trace, visual inspection of operations
DPA — Statistical correlation across many traces
CPA — Correlation with hypothetical power models
HO-DPA — Combines multiple time points to defeat masking

📡 Electromagnetic

SEMA — Simple EM analysis (like SPA)
DEMA — Differential EM analysis
EM probes can isolate individual circuit blocks — more spatially precise than power analysis.

⏱️ Timing

Variable execution time leaks secret-dependent branches. Cache-timing attacks exploit data-dependent memory access patterns (e.g., AES T-table lookups).

⚠️ Fault Injection

Voltage glitching, clock glitching, laser fault injection. Induces computational errors that reveal key bits via differential fault analysis (DFA).

Key insight: DPA can extract secret keys from as few as 50–1000 traces using standard equipment costing under $1000. Even highly noisy environments can be defeated by collecting more traces — the noise averages out.

Hardware Countermeasures

TechniqueDefends AgainstOverheadEffectiveness
Boolean Masking (d shares)DPA up to order d−1~d² areaProvable in probing model
Threshold Implementation (TI)1st/2nd order DPA3–5× areaGlitch-resistant; proven
Dual-Rail / WDDL LogicSPA, DPA2× area, 2× powerConstant power per op
Random Delays / ShufflingSPA, DPA (reduces SNR)TRNG + controlRaises attack complexity
Constant-Time DesignTiming attacks~0% areaEliminates timing channel
Blinding (Asymmetric)DPA on RSA/ECC1 extra mulRandomizes intermediate values
EM Shielding MeshEM probingTop metal layersPhysical barrier
Voltage/Clock MonitorsFault injectionSensor areaDetects glitch attempts
Masking in practice: Rambus DPA-resistant cores validate with TVLA (Test Vector Leakage Assessment) methodology — no detectable leakage beyond 100 million traces, protecting against first and second-order attacks beyond 1 billion operations.

Threshold Implementation (TI)

TI splits every sensitive variable into d+1 shares that are processed independently. Even if an attacker observes glitch-induced transient leakage, no single share reveals information.

TI Properties

Correctness
Combining shares recovers the correct output: y = y₁ ⊕ y₂ ⊕ ... ⊕ yd+1

Non-Completeness
Each share function fi is independent of at least one input share — prevents leakage

Uniformity
Output sharing is uniformly distributed — no bias that could leak information

// 2-share TI for AES S-Box (simplified concept)
// Each share never sees the complete unmasked value

module ti_sbox (
  input  [7:0] x_share1, x_share2,  // Masked input shares
  input  [7:0] fresh_random,          // Fresh randomness per clock
  output [7:0] y_share1, y_share2    // Masked output shares
);
  // Non-linear layer uses cross-domain terms with fresh masks
  // Linear layers (affine transform) applied to each share independently
  // Area: ~6,000 GE vs ~350 GE unmasked — ~17× overhead
endmodule
SECTION V

Implementation Platforms

ASIC · FPGA · SoC · Instruction Set Extensions

ASIC vs. FPGA Trade-offs

CriterionASICFPGA
PerformanceMaximum frequency; custom gates
45 nm AES: 53 Gbps
Routing delays limit fmax
Kintex US+: 206 Gbps (multi-core)
Area/CostSmallest silicon per function
High NRE ($1M+ for 28 nm tapeout)
~10–50× more silicon per function
Low NRE, per-unit cost higher
PowerLowest — no config overhead
65 nm AES: <50 μW achievable
Higher static + dynamic power
Always-on configuration SRAM
FlexibilityFixed at fabrication
Bugs require respin
Reconfigurable post-deployment
Algorithm agility for PQC migration
Time-to-Market12–24 monthsWeeks to months
Side-ChannelCustom logic styles possible
Full control over layout
Limited routing control
But partial bitstream patching helps
Best ForHigh-volume: smart cards, SoCs,
data-center NICs, HSMs
Prototyping, low-volume,
crypto-agile systems, cloud FPGAs

Crypto Instruction Set Extensions

Rather than a fully separate accelerator, ISEs add crypto-specific instructions to the CPU pipeline. They offer a middle ground between pure software and dedicated hardware.

x86 / AArch64

AES-NI — Single-round AES in one instruction
PCLMULQDQ — Carryless multiply for GHASH
SHA-NI — SHA-1/SHA-256 rounds
SM3/SM4 — Chinese national algorithms
AArch64: AESE, AESMC, SHA256H

RISC-V Crypto (Zkn/Zks)

aes32esi / aes64es — AES encrypt step
sha256sig0 — SHA-256 sigma
sm4ed — SM4 encrypt/decrypt
clmul — Carryless multiply
Custom: RANTT — NTT for PQC

Hybrid approach for PQC: RISC-V + Big Number Accelerator (OTBN) as in OpenTitan. ISEs for polynomial arithmetic and NTT butterfly operations accelerate Kyber/Dilithium while the main CPU handles control flow and rejection sampling. This co-design approach achieves practical performance on a silicon root-of-trust platform.
SECTION VI

Protocol-Level Offload

TLS · IPsec · MACsec · Disk Encryption

Inline Encryption Engines

Modern network SoCs embed crypto engines directly in the data path. The CPU never touches plaintext packets — the engine encrypts/decrypts at wire speed.

Network RX Packet Parser SA Lookup SPI → Key/IV Crypto Engine AES-GCM / ChaCha20-Poly1305 Multi-context pipeline Anti-replay window IV generation + auth tag verify Reassembly Header rewrite Checksum update TX

IPsec

Security Association (SA) database in on-chip TCAM. Supports 10K+ concurrent tunnels. ESP encrypt + HMAC at 100+ Gbps.

MACsec (802.1AE)

Layer-2 hop-by-hop encryption. AES-GCM-256 at line rate. Critical for industrial networks and data-center fabric.

TLS Offload

Asymmetric (RSA/ECC handshake) + symmetric (AES-GCM record layer). SmartNIC designs offload both from host CPU.

SECTION VII

RTL Design Examples

SystemVerilog implementations of key building blocks

SystemVerilog: AES Round

module aes_round (
  input  logic [127:0] state_in,
  input  logic [127:0] round_key,
  input  logic         is_last_round,  // Skip MixColumns on round 10
  output logic [127:0] state_out
);
  logic [127:0] after_sub, after_shift, after_mix;

  // SubBytes: 16 parallel S-Box instances (composite field)
  genvar i;
  generate
    for (i = 0; i < 16; i++) begin : gen_sbox
      aes_sbox_cf u_sbox (
        .in  (state_in[8*i +: 8]),
        .out (after_sub[8*i +: 8])
      );
    end
  endgenerate

  // ShiftRows: fixed byte permutation (zero-cost in hardware — just wiring)
  assign after_shift = {
    after_sub[127:120], after_sub[ 87: 80], after_sub[ 47: 40], after_sub[  7:  0],
    after_sub[ 95: 88], after_sub[ 55: 48], after_sub[ 15:  8], after_sub[103: 96],
    after_sub[ 63: 56], after_sub[ 23: 16], after_sub[111:104], after_sub[ 71: 64],
    after_sub[ 31: 24], after_sub[119:112], after_sub[ 79: 72], after_sub[ 39: 32]
  };

  // MixColumns: GF(2^8) matrix multiply per column
  generate
    for (i = 0; i < 4; i++) begin : gen_mixcol
      aes_mixcolumn u_mix (
        .in  (after_shift[32*i +: 32]),
        .out (after_mix[32*i +: 32])
      );
    end
  endgenerate

  // AddRoundKey: 128-bit XOR
  assign state_out = (is_last_round ? after_shift : after_mix) ^ round_key;
endmodule

SystemVerilog: Composite Field S-Box

// AES S-Box via tower field: GF(2^8) → GF((2^4)^2) → GF(((2^2)^2)^2)
// No ROM — pure combinational logic (~300 gates)
module aes_sbox_cf (
  input  logic [7:0] in,
  output logic [7:0] out
);
  logic [7:0] mapped, inv_mapped;
  logic [3:0] hi, lo, inv_hi, inv_lo;
  logic [3:0] sum_hl, sq_scale, prod, inv_d, t;

  // Step 1: Isomorphic mapping GF(2^8) → GF((2^4)^2)
  assign mapped = iso_map(in);     // 8×8 binary matrix multiply
  assign hi = mapped[7:4];
  assign lo = mapped[3:0];

  // Step 2: Compute inverse in GF((2^4)^2)
  assign sum_hl   = hi ^ lo;                        // GF(2^4) add
  assign sq_scale = gf16_sq_scale(hi);               // hi² · λ (combined)
  assign prod     = gf16_mul(sum_hl, lo);             // (hi⊕lo) · lo
  assign t        = sq_scale ^ prod;                  // Norm: hi²λ ⊕ (hi⊕lo)·lo
  assign inv_d    = gf16_inv(t);                      // Inversion in GF(2^4)
  assign inv_hi   = gf16_mul(sum_hl, inv_d);          // (hi⊕lo) · inv_d
  assign inv_lo   = gf16_mul(hi, inv_d);              // hi · inv_d

  // Step 3: Inverse map + affine transformation
  assign inv_mapped = inv_iso_map({inv_hi, inv_lo});
  assign out        = affine_transform(inv_mapped);
endmodule

Each gf16_* function decomposes further into GF(2²) arithmetic — ultimately reducing to AND/XOR gates. The deepest path is ~22 XOR + ~12 AND gate delays.

SystemVerilog: NTT Butterfly

// Cooley-Tukey NTT butterfly for ML-KEM (Kyber)
// q = 3329, coefficients are 12-bit
module ntt_butterfly #(
  parameter Q     = 3329,
  parameter WIDTH = 12
)(
  input  logic                clk, rst_n,
  input  logic [WIDTH-1:0]    a_in, b_in,     // Input coefficients
  input  logic [WIDTH-1:0]    omega,           // Twiddle factor (root of unity)
  output logic [WIDTH-1:0]    a_out, b_out     // Output coefficients
);
  logic [2*WIDTH-1:0] prod;
  logic [WIDTH-1:0]   omega_b, sum_val, diff_val;

  // Modular multiplication: ω · b (mod q) via Montgomery
  assign prod = omega * b_in;                   // Full-width multiply

  // Montgomery reduction: t = prod · q_inv mod R; result = (prod + t·q) >> R
  montgomery_reduce #(.Q(Q), .W(WIDTH)) u_mont (
    .product (prod),
    .result  (omega_b)
  );

  // Butterfly: a' = a + ω·b (mod q),  b' = a - ω·b (mod q)
  mod_add_sub #(.Q(Q), .W(WIDTH)) u_add (
    .a(a_in), .b(omega_b), .sum(sum_val), .diff(diff_val)
  );

  // Pipeline register
  always_ff @(posedge clk or negedge rst_n)
    if (!rst_n) begin a_out <= '0; b_out <= '0; end
    else        begin a_out <= sum_val; b_out <= diff_val; end
endmodule

SystemVerilog: Montgomery Reduction

// Multiplier-less Montgomery reduction for Kyber (q = 3329)
// Exploits q = 13·256 + 1 structure for efficient constant multiply
module montgomery_reduce #(
  parameter Q     = 3329,
  parameter Q_INV = 62209,  // q^(-1) mod 2^16
  parameter W     = 12
)(
  input  logic [2*W-1:0]  product,   // 24-bit input (a × b)
  output logic [W-1:0]    result     // Reduced to [0, q)
);
  logic [15:0] t;
  logic [31:0] u;
  logic signed [W:0] r;

  // Step 1: t = (product mod R) × Q_INV mod R
  //   where R = 2^16
  assign t = product[15:0] * Q_INV[15:0];  // Low 16 bits only

  // Step 2: u = (product + t × Q) >> 16
  assign u = product + (t * Q);
  
  // Step 3: Conditional subtraction
  assign r = u[31:16];
  assign result = (r >= Q) ? (r - Q) : r[W-1:0];
endmodule

Key optimization: For Kyber's q=3329, the constant multiply by Q_INV can be decomposed into shifts and adds, eliminating the need for a hardware multiplier entirely — reducing the butterfly to pure shift-XOR-add logic.

SECTION VIII

Performance Benchmarks

Area · Throughput · Power · Comparisons

AES Implementation Comparison

ImplementationPlatformThroughputAreaEfficiency
OpenSSL (SW, Xeon)x86 + AES-NI~6 GbpsCPU-bound
8-bit iterative90 nm ASIC94 Mbps2,645 GE35.6 Kbps/GE
Composite field 128b45 nm ASIC53 Gbps~100K GE530 Kbps/GE
Fully pipelinedArtix-7 FPGA72 Gbps~4K Slices18 Mbps/Slice
Multi-core (x4)KU+ FPGA742 Gbps~16K Slices46 Mbps/Slice
TI masked (2-share)180 nm ASIC~1 Gbps~40K GEDPA resistant
AES-GCM SoCVirtex-5 FPGA25.6 Gbps69% of SoC4-parallel engine
Scale insight: From a 2,645-gate IoT core at 94 Mbps to a 742 Gbps multi-core FPGA engine — AES hardware spans a 7,800× throughput range depending on area budget and target application.

Post-Quantum Hardware Benchmarks

DesignPlatformAlgorithmNTT CyclesArea (LUTs)ATP Improvement
SDF-NTT (pipelined)Artix-7ML-KEM224~4K49.6% vs. SoA
Unified NTT (4×BF)Artix-7ML-KEM + ML-DSA224 / 512~6K82% (Kyber)
MDC-NTTKintex-7ML-KEM5,551 total~12K38.9% latency↓
FalconSign28 nm ASICFALCON-5120.71 mm²5.1× vs. SoA
RISC-V + OTBN ISEOpenTitanDilithium verifyRoT SoCPQC on RoT
HLS batch acceleratorAlveo FPGAML-KEM + ML-DSAbatch~50KHigh-volume server

Post-quantum hardware is rapidly maturing. Unified Kyber+Dilithium accelerators sharing NTT/Keccak datapaths will become the standard building block for PQC-ready SoCs.

Crypto Accelerator Design Flow

1. Algorithm Analysis
Identify bottlenecks.
Map to parallelism.
Choose field arithmetic.
2. Architecture
Datapath width.
Pipeline depth.
Memory architecture.
3. RTL Design
SystemVerilog/VHDL.
Parameterized modules.
Formal verification.
4. Side-Channel
Add masking/TI.
Constant-time audit.
TVLA simulation.
5. FPGA Prototype
Synthesis + P&R.
Timing closure.
Functional validation.
6. SCA Evaluation
DPA workstation.
EM probing.
Fault injection test.
7. ASIC Synthesis
Standard cell mapping.
DFT insertion.
Power/area optimization.
8. Certification
FIPS 140-3.
Common Criteria.
EMVCo (payments).
Certification matters: FIPS 140-3 Level 3+ requires physical tamper resistance and environmental failure testing. Common Criteria EAL5+ demands formal security proofs. These requirements fundamentally shape hardware architecture decisions.
SECTION IX

Future Directions

PQC migration · FHE acceleration · Crypto agility

Emerging Hardware Challenges

🔄 Crypto Agility

The PQC transition requires hardware that can switch algorithms post-deployment. Reconfigurable engines and ISEs provide agility that fixed ASIC logic cannot. Hybrid classical+PQC modes require parallel datapaths.

🔐 Fully Homomorphic Encryption

FHE demands NTT operations on polynomials with millions of coefficients and 100+ bit moduli. FPGA accelerators (FAB) achieve 456× speedup over CPU. Custom ASIC designs target practical bootstrapping times.

🧊 Post-Quantum Signatures

FALCON requires floating-point discrete Gaussian sampling — unusual for crypto hardware. ASIC implementations at 28 nm achieve 5.2K signatures/sec with dedicated FFT sampling engines.

📐 Zero-Knowledge Hardware

ZK-SNARKs/STARKs require massive NTT/MSM (multi-scalar multiplication) computations. GPU and FPGA accelerators race to make real-time ZK proofs practical for blockchain and privacy applications.

The convergence: NTT engines appear in PQC, FHE, and ZK-proof accelerators. A well-designed, configurable NTT core becomes reusable IP across multiple next-generation cryptographic applications.

References & Standards

NIST Standards

FIPS 197 — Advanced Encryption Standard (AES)
FIPS 198 — HMAC
FIPS 202 — SHA-3 (Keccak)
FIPS 203 — ML-KEM (Kyber) — Aug 2024
FIPS 204 — ML-DSA (Dilithium) — Aug 2024
FIPS 205 — SLH-DSA (SPHINCS+) — Aug 2024
FIPS 140-3 — Security Requirements for Crypto Modules
SP 800-186 — Elliptic Curve Recommendations

Key Papers

Canright, "A Very Compact Rijndael S-box" (CHES 2005)
Boyar & Peralta, "New Logic Minimization Techniques" (2010)
Kocher, Jaffe & Jun, "Differential Power Analysis" (CRYPTO 1999)
Koç et al., "Finite Field Arithmetic for Cryptography" (IEEE 2010)
Montgomery, "Modular Multiplication Without Trial Division" (1985)

Hardware References

Hasan, "53 Gbps Composite-Field AES" (JSSC 2011, 45 nm)
Ouyang et al., "FalconSign" (TCHES 2025, 28 nm)
FAB: FPGA Accelerator for Bootstrappable FHE (HPCA 2023)
EMINEM: Mixed-Radix NTT for PQC (ACM TRETS 2025)
OpenTitan: Open-Source Silicon Root of Trust

Side-Channel

Rambus DPA Countermeasures & TVLA Methodology
Nikova, Rechberger & Rijmen, "Threshold Implementations" (2006)
Mangard, Oswald & Popp, "Power Analysis Attacks" (Springer)
ChipWhisperer Open-Source SCA Platform

Summary

Symmetric engines — AES pipeline architectures from 2.6K gates to 742 Gbps, S-Box design via composite field decomposition, AES-GCM authenticated encryption with GF(2¹²⁸) GHASH.

Asymmetric engines — Montgomery multiplication architectures (bit-serial to systolic), ECC scalar multiplication with projective coordinates, FPGA DSP block exploitation.

Post-quantum engines — NTT butterfly architectures (SDF, MDC, mixed-radix), unified Kyber+Dilithium datapaths, Keccak/SHAKE as shared infrastructure, RISC-V SoC integration.

Side-channel defense — Boolean masking, threshold implementation, dual-rail logic, TVLA validation, fault injection countermeasures.

Implementation — ASIC vs. FPGA trade-offs, ISE extensions (AES-NI, RISC-V Zkn), protocol offload engines, certification requirements.

Modern Cryptography Series · Presentation 10 of 10