Presentation 03
Design & Implementation
FIPS 197 · Rijndael · GF(28) Arithmetic · S-box Construction
Key Schedule · Hardware Architectures · Side-Channel Countermeasures
→ / ← navigate Esc overview F fullscreen ?print-pdf export
1977 — DES published (FIPS 46), 56-bit key
1997 — NIST calls for AES candidates
1998 — 15 submissions from 12 countries
2000 — Rijndael selected as winner
2001 — FIPS 197 published
2003 — NSA approves for classified data
✓ Clean algebraic structure — easy to analyze
✓ Best combination of security + performance
✓ Efficient on 8-bit µC through 64-bit CPUs
✓ Low memory footprint
✓ Parallelizable — suited to pipelining
✓ Simple key schedule
Designed by Joan Daemen & Vincent Rijmen
Katholieke Universiteit Leuven, Belgium
FIPS 197 (2001) · NIST AES Selection Report · Daemen & Rijmen, "The Design of Rijndael" (Springer, 2002)
Each round applies four transformations to a 4×4 byte State matrix:
Final round omits MixColumns.
Non-linear layer — SubBytes (S-box)
Provides confusion via GF(28) inversion
Linear diffusion layer — ShiftRows + MixColumns
Provides diffusion via byte transposition & MDS mixing
Key mixing layer — AddRoundKey
XOR of state with round key
AES operates on a 4×4 array of bytes arranged in column-major order.
in0 in1 in2 … in15
↓ maps to columns ↓
Sr,c = in[r + 4c]
in[0] in[4] in[8] in[12]
in[1] in[5] in[9] in[13]
in[2] in[6] in[10] in[14]
in[3] in[7] in[11] in[15]
Each column is a 32-bit word — this is critical for efficient 32-bit CPU implementations using T-tables.
The state is transformed round-by-round and written back in column-major order to produce the ciphertext.
State = Plaintext
// Initial round key addition
AddRoundKey(State, Key0)
// Main rounds 1 … Nr−1
for r = 1 to Nr − 1:
SubBytes(State)
ShiftRows(State)
MixColumns(State)
AddRoundKey(State, Keyr)
// Final round (no MixColumns)
SubBytes(State)
ShiftRows(State)
AddRoundKey(State, KeyNr)
Ciphertext = State
| Key Length | Rounds (Nr) | Round Keys |
|---|---|---|
| 128 bits | 10 | 11 |
| 192 bits | 12 | 13 |
| 256 bits | 14 | 15 |
All AES byte operations live in the finite field GF(28) defined by the irreducible polynomial:
Coefficient-wise XOR
a(x) ⊕ b(x)
No carries — each bit is independent.
Polynomial multiply then reduce mod m(x)
a(x) · b(x) mod m(x)
Efficiently implemented via xtime() — left-shift + conditional XOR with 0x1B.
def xtime(a):
"""Multiply by x in GF(2^8)"""
result = a << 1
if result & 0x100: # overflow?
result ^= 0x11B # reduce mod m(x)
return result & 0xFF
Every non-zero element has a unique inverse: a−1 such that a · a−1 = 1
By Fermat's little theorem: a−1 = a254 in GF(28)
Or use the Extended Euclidean Algorithm.
FIPS 197 §4.2 · Presentation 01 covers GF(28) in full depth
The only non-linear operation in AES. Each byte is independently transformed.
Step 1 Multiplicative inverse in GF(28)
b = a−1 (with 0−1 defined as 0)
Provides non-linearity — max algebraic degree.
Step 2 Affine transformation over GF(2)
bi' = bi ⊕ b(i+4)mod8 ⊕ b(i+5)mod8
⊕ b(i+6)mod8 ⊕ b(i+7)mod8 ⊕ ci
where c = 0x63 = 01100011₂
The affine step breaks the pure algebraic structure of the inverse, preventing interpolation attacks.
Non-linearity: 112 (maximum for 8-bit bijection)
Algebraic degree: 7
Differential uniformity: 4
Fixed points: 0 (no byte maps to itself)
No opposite fixed points
S(0x00) = 0x63 S(0x01) = 0x7C
S(0x53) = 0xED S(0xFF) = 0x16
Input 0x53: inverse in GF(28) = 0xCA, then affine → 0xED
| .0 | .1 | .2 | .3 | .4 | .5 | .6 | .7 | .8 | .9 | .a | .b | .c | .d | .e | .f | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0. | 63 | 7c | 77 | 7b | f2 | 6b | 6f | c5 | 30 | 01 | 67 | 2b | fe | d7 | ab | 76 |
| 1. | ca | 82 | c9 | 7d | fa | 59 | 47 | f0 | ad | d4 | a2 | af | 9c | a4 | 72 | c0 |
| 2. | b7 | fd | 93 | 26 | 36 | 3f | f7 | cc | 34 | a5 | e5 | f1 | 71 | d8 | 31 | 15 |
| 3. | 04 | c7 | 23 | c3 | 18 | 96 | 05 | 9a | 07 | 12 | 80 | e2 | eb | 27 | b2 | 75 |
| 4. | 09 | 83 | 2c | 1a | 1b | 6e | 5a | a0 | 52 | 3b | d6 | b3 | 29 | e3 | 2f | 84 |
| 5. | 53 | d1 | 00 | ed | 20 | fc | b1 | 5b | 6a | cb | be | 39 | 4a | 4c | 58 | cf |
| 6. | d0 | ef | aa | fb | 43 | 4d | 33 | 85 | 45 | f9 | 02 | 7f | 50 | 3c | 9f | a8 |
| 7. | 51 | a3 | 40 | 8f | 92 | 9d | 38 | f5 | bc | b6 | da | 21 | 10 | ff | f3 | d2 |
| 8. | cd | 0c | 13 | ec | 5f | 97 | 44 | 17 | c4 | a7 | 7e | 3d | 64 | 5d | 19 | 73 |
| 9. | 60 | 81 | 4f | dc | 22 | 2a | 90 | 88 | 46 | ee | b8 | 14 | de | 5e | 0b | db |
| a. | e0 | 32 | 3a | 0a | 49 | 06 | 24 | 5c | c2 | d3 | ac | 62 | 91 | 95 | e4 | 79 |
| b. | e7 | c8 | 37 | 6d | 8d | d5 | 4e | a9 | 6c | 56 | f4 | ea | 65 | 7a | ae | 08 |
| c. | ba | 78 | 25 | 2e | 1c | a6 | b4 | c6 | e8 | dd | 74 | 1f | 4b | bd | 8b | 8a |
| d. | 70 | 3e | b5 | 66 | 48 | 03 | f6 | 0e | 61 | 35 | 57 | b9 | 86 | c1 | 1d | 9e |
| e. | e1 | f8 | 98 | 11 | 69 | d9 | 8e | 94 | 9b | 1e | 87 | e9 | ce | 55 | 28 | df |
| f. | 8c | a1 | 89 | 0d | bf | e6 | 42 | 68 | 41 | 99 | 2d | 0f | b0 | 54 | bb | 16 |
Read as: S(0xRC) → row R, column C. Example: S(0x53) = row 5, col 3 → ED
A byte-level transposition that shifts each row cyclically to the left.
Before ShiftRows
After ShiftRows
| Row | Shift (bytes left) |
|---|---|
| Row 0 | 0 (no shift) |
| Row 1 | 1 |
| Row 2 | 2 |
| Row 3 | 3 |
Purpose: Prevents each column from being encrypted independently —
ensures bytes from different columns interact in MixColumns.
Each column is treated as a polynomial over GF(28) and multiplied by a fixed MDS polynomial.
| 02 | 03 | 01 | 01 |
| 01 | 02 | 03 | 01 |
| 01 | 01 | 02 | 03 |
| 03 | 01 | 01 | 02 |
| S0 |
| S1 |
| S2 |
| S3 |
Only uses coefficients 01, 02, 03:
×01 = identity
×02 = xtime(a)
×03 = xtime(a) ⊕ a
Branch number = 5 (maximum possible for 4×4)
A difference in 1 input byte propagates to all 4 output bytes.
Any 2-byte input difference → at least 3-byte output difference.
For decryption, use the inverse matrix with coefficients:
0E, 0B, 0D, 09
These are more expensive to compute — a key factor in hardware design.
The polynomial c(x) = {03}x³ + {01}x² + {01}x + {02} is coprime to x⁴ + 1, guaranteeing invertibility.
The simplest transformation: bitwise XOR of the state with the round key.
✓ XOR is its own inverse — same for encrypt/decrypt
✓ Zero-cost in hardware — just wire connections
✓ Introduces the secret key into every round
✓ Without it, the cipher would be a public permutation
Each 128-bit round key is drawn from the expanded key schedule:
Keyr = W[4r] ‖ W[4r+1] ‖ W[4r+2] ‖ W[4r+3]
where W[] is the expanded key array (44 words for AES-128)
On a 32-bit CPU, AddRoundKey is 4 XOR operations (one per column-word).
state[c] ^= roundkey[c]; // c = 0..3
Expands the 128-bit cipher key into 44 × 32-bit words (11 round keys).
def key_expansion(key):
W = [0] * 44
# First 4 words = cipher key
for i in range(4):
W[i] = key_word(key, i)
for i in range(4, 44):
temp = W[i - 1]
if i % 4 == 0:
temp = sub_word(rot_word(temp))
temp ^= RCON[i // 4]
W[i] = W[i - 4] ^ temp
return W
RotWord — Rotate 4 bytes left by 1 position
[a₀ a₁ a₂ a₃] → [a₁ a₂ a₃ a₀]
SubWord — Apply S-box to each of 4 bytes
[a₀ a₁ a₂ a₃] → [S(a₀) S(a₁) S(a₂) S(a₃)]
Rcon[i] — Round constant
= [xi−1, 00, 00, 00] in GF(28)
01, 02, 04, 08, 10, 20, 40, 80, 1B, 36
Expands 192-bit key into 52 words (13 round keys)
Non-linear function (SubWord + RotWord + Rcon) applied every 6th word:
if i mod 6 == 0: apply g()
Otherwise: simple XOR chain
Expands 256-bit key into 60 words (15 round keys)
Extra SubWord applied at position 4 within each 8-word group:
if i mod 8 == 0: apply g()
if i mod 8 == 4: SubWord only
This extra non-linear step was added to resist related-key attacks.
| Variant | Key Words (Nk) | Expanded Words | Round Keys | g() applied every |
|---|---|---|---|---|
| AES-128 | 4 | 44 | 11 | 4 words |
| AES-192 | 6 | 52 | 13 | 6 words |
| AES-256 | 8 | 60 | 15 | 8 words (+ SubWord at 4) |
AES decryption applies the inverse of each operation in reverse order.
AddRoundKey(State, KeyNr)
for r = Nr−1 downto 1:
InvShiftRows(State)
InvSubBytes(State)
AddRoundKey(State, Keyr)
InvMixColumns(State)
InvShiftRows(State)
InvSubBytes(State)
AddRoundKey(State, Key0)
Key insight: By reordering InvSubBytes↔InvShiftRows and InvMixColumns↔AddRoundKey, the inverse cipher has the same structure as encryption.
This requires applying InvMixColumns to round keys 1…Nr−1 during key expansion.
| Encrypt | Decrypt | Cost change |
|---|---|---|
| SubBytes | InvSubBytes | Same (separate LUT) |
| ShiftRows | InvShiftRows | Shift right |
| MixColumns ×{02,03} | InvMixColumns ×{09,0B,0D,0E} | ~3× more expensive |
| AddRoundKey | AddRoundKey | XOR = self-inverse |
On 32-bit processors, SubBytes + ShiftRows + MixColumns can be fused into four 256-entry 32-bit lookup tables.
T₀[a] = [ 02·S(a), S(a), S(a), 03·S(a) ]
T₁[a] = [ 03·S(a), 02·S(a), S(a), S(a) ]
T₂[a] = [ S(a), 03·S(a), 02·S(a), S(a) ]
T₃[a] = [ S(a), S(a), 03·S(a), 02·S(a) ]
ej = T₀[a0,j] ⊕ T₁[a1,j+1] ⊕ T₂[a2,j+2] ⊕ T₃[a3,j+3] ⊕ Kj
| Metric | S-box | T-table |
|---|---|---|
| Tables | 1 × 256 B | 4 × 1 KB |
| Total memory | 256 B | 4 KB |
| Lookups/round | 16 | 16 |
| XORs/round | many | 16 (32-bit) |
| Ops per round | ~160 | ~20 |
⚠ Cache-timing vulnerability
Table access patterns leak key-dependent indices. Bernstein (2005) demonstrated full AES key recovery from timing measurements alone.
Intel/AMD processors include dedicated AES instructions since Westmere (2010).
| Instruction | Operation |
|---|---|
| AESENC | One encryption round |
| AESENCLAST | Final encryption round |
| AESDEC | One decryption round |
| AESDECLAST | Final decryption round |
| AESKEYGENASSIST | Key schedule helper |
| AESIMC | InvMixColumns (for EqInv keys) |
AESENC — single cycle latency, fully pipelined
With AES-NI, AES-128-CTR achieves ~4 cycles/byte on modern x86.
That's ~8 GB/s single-core at 4 GHz.
✓ Constant-time execution — immune to cache-timing attacks
✓ No S-box/T-table in memory — no table access patterns
✓ S-box computed in hardware via GF(24)² composite field
Intel® Advanced Encryption Standard Instructions (AES-NI) White Paper (2010) · ARM Cryptographic Extension (ARMv8-A, 2011)
AES hardware spans a wide design space from compact IoT to high-throughput network processors.
Single round, reused Nr times
Area: ~3,000–5,000 GE
Throughput: ~1–3 Gbps
Latency: 10–14 cycles
Ideal for area-constrained IoT & smart cards
Pipeline registers within a single round
Area: ~8,000–15,000 GE
Throughput: ~5–15 Gbps
Latency: 20–40 cycles
Balances area and speed
All 10/14 rounds instantiated in hardware
Area: ~50,000–170,000 GE
Throughput: ~30–53 Gbps
Latency: 10–14 cycles
Maximum throughput for network & SoC
| Architecture | Platform | Area | Throughput | Efficiency (Gbps/kGE) |
|---|---|---|---|---|
| 8-bit serial | 180nm ASIC | ~3.1 kGE | ~80 Mbps | 0.026 |
| Iterative 128-bit | Virtex-5 FPGA | 1,364 slices | 3.2 Gbps | — |
| Sub-pipelined | Virtex-6 FPGA | 2,100 slices | 12.8 Gbps | — |
| Composite-field (Mathew et al.) | 45nm ASIC | 56 kGE | 53 Gbps | 0.946 |
Mathew et al., "53 Gbps AES" IEEE JSSC 2011 · Chodowiec & Gaj, "Very Compact FPGA AES" (2003)
The S-box is the most expensive block in AES hardware. Two main approaches exist.
Store the 256×8-bit table directly:
✓ Simple — direct memory lookup
✓ FPGA: fits in one 18Kb BRAM
✗ 16 S-boxes needed → 16 BRAMs per round
✗ Larger area on ASIC
✗ Susceptible to power side-channels
Decompose GF(28) into tower field GF((24)²):
Inversion in GF(28) → operations in GF(24)
✓ ~20% fewer gates (Canright, 2005)
✓ Used in Intel AES-NI hardware
✓ Better for ASIC — pure logic
✗ More complex design verification
GF(28) → isomorphic to GF((24)2) → further to GF(((22)2)2)
Each inversion reduces to multiplications and squarings in progressively smaller subfields, eventually reaching GF(2²) where inversion is just XOR + AND gates.
Canright, "A Very Compact Rijndael S-box" (2005) · Mathew et al., IEEE JSSC (2011)
AES is a block cipher (128-bit). To encrypt arbitrary-length data, use a mode of operation.
Each block encrypted independently
✗ Deterministic — reveals patterns
✗ Never use for real data
Each block XORed with previous ciphertext
✓ Widely deployed (TLS ≤1.2)
✗ Sequential — can't parallelize encryption
✗ Padding oracle attacks
Encrypt counter → XOR with plaintext
✓ Fully parallelizable
✓ Encryption-only hardware needed
✗ No integrity protection
CTR encryption + GHASH authentication
✓ Authenticated encryption (AEAD)
✓ Fully parallelizable
✓ Hardware-friendly — GHASH uses GF(2128)
The standard choice for TLS 1.3, IPsec, SSH
XTS — Tweakable, used for disk encryption
CCM — CTR + CBC-MAC, used in Wi-Fi (WPA2/3)
SIV — Nonce-misuse resistant AEAD
GCM-SIV — Nonce-misuse resistant GCM variant
TIMING
T-table cache access patterns reveal key-dependent indices. Bernstein's attack (2005) recovers full AES-256 key from network timing.
POWER / EM
Differential Power Analysis (DPA) — correlate power traces with hypothetical S-box outputs across many encryptions.
Target: first or last round SubBytes, where known plaintext/ciphertext meets key.
FAULT
Differential Fault Analysis (DFA) — inject faults in round 8/9, compare correct/faulty ciphertexts → recover key.
A single byte fault in round 9 can reduce key search to ~232.
The S-box is the primary target because:
1. Only non-linear operation → strongest correlation with power
2. Operates byte-by-byte → divide-and-conquer (16 × 28 vs 2128)
3. T-table indices directly reveal key bytes
Combined attacks exploit both fault injection and power analysis simultaneously, defeating higher-order masking countermeasures.
Roche et al. (2011), Dassance & Venelli (2012)
Cache attacks — Flush+Reload, Prime+Probe on shared L3 cache can extract AES keys across VMs.
MASKING
Split every sensitive variable into d+1 random shares such that x = x₁ ⊕ x₂ ⊕ … ⊕ xd+1
dth-order masking requires d+1 shares; attacker must combine d+1 trace points.
SHUFFLING
Randomize the order of S-box computations within a round (16 S-box calls can be permuted in 16! ways).
BITSLICING
Compute S-box as Boolean circuit — constant time, no table lookups. 32 AES blocks in parallel on 32-bit CPU.
DUAL-RAIL LOGIC
Each bit represented as complementary pair (a, ā). Constant Hamming weight per operation.
THRESHOLD IMPLEMENTATIONS
Hardware secret sharing with provable glitch resistance. Each share processed by independent logic.
REDUNDANCY (DFA defense)
Temporal: encrypt twice, compare
Spatial: TMR (triple modular redundancy)
Inverse check: decrypt ciphertext, compare with plaintext
NOISE INJECTION
Random delays, dummy operations, clock jitter to decorrelate power traces.
SBOX = [
0x63,0x7c,0x77,0x7b,0xf2,0x6b,0x6f,0xc5,0x30,0x01,0x67,0x2b,0xfe,0xd7,0xab,0x76,
0xca,0x82,0xc9,0x7d,0xfa,0x59,0x47,0xf0,0xad,0xd4,0xa2,0xaf,0x9c,0xa4,0x72,0xc0,
# ... full 256-entry table ...
]
RCON = [0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,0x1b,0x36]
def xtime(a): return ((a << 1) ^ 0x1B) & 0xFF if a & 0x80 else (a << 1) & 0xFF
def gf_mul(a, b): # multiply in GF(2^8) via repeated xtime
p = 0
for _ in range(8):
if b & 1: p ^= a
a = xtime(a); b >>= 1
return p
def sub_bytes(state): return [SBOX[b] for b in state]
def shift_rows(s): return [s[0],s[5],s[10],s[15], s[4],s[9],s[14],s[3],
s[8],s[13],s[2],s[7], s[12],s[1],s[6],s[11]]
def mix_columns(s):
r = []
for c in range(4):
i = c * 4
r += [gf_mul(2,s[i])^gf_mul(3,s[i+1])^s[i+2]^s[i+3],
s[i]^gf_mul(2,s[i+1])^gf_mul(3,s[i+2])^s[i+3],
s[i]^s[i+1]^gf_mul(2,s[i+2])^gf_mul(3,s[i+3]),
gf_mul(3,s[i])^s[i+1]^s[i+2]^gf_mul(2,s[i+3])]
return r
def add_round_key(s, k): return [a ^ b for a, b in zip(s, k)]
def aes_encrypt(plaintext, key): # both 16-byte lists
state = add_round_key(plaintext, key[:16])
rkeys = key_expansion(key) # returns list of 11 × 16-byte round keys
for r in range(1, 10):
state = sub_bytes(state)
state = shift_rows(state)
state = mix_columns(state)
state = add_round_key(state, rkeys[r])
state = sub_bytes(state)
state = shift_rows(state)
state = add_round_key(state, rkeys[10])
return state
module aes_round (
input logic [127:0] state_in,
input logic [127:0] round_key,
input logic last_round, // skip MixColumns if high
output logic [127:0] state_out
);
logic [127:0] after_sub, after_shift, after_mix;
// ---- SubBytes: 16 parallel S-box instances ----
genvar i;
generate
for (i = 0; i < 16; i++) begin : g_sbox
aes_sbox u_sbox (
.in (state_in[8*i +: 8]),
.out (after_sub[8*i +: 8])
);
end
endgenerate
// ---- ShiftRows: wire-level byte permutation ----
assign after_shift = {
after_sub[127:120], after_sub[ 87: 80], after_sub[ 47: 40], after_sub[ 7: 0],
after_sub[ 95: 88], after_sub[ 55: 48], after_sub[ 15: 8], after_sub[103: 96],
after_sub[ 63: 56], after_sub[ 23: 16], after_sub[111:104], after_sub[ 71: 64],
after_sub[ 31: 24], after_sub[119:112], after_sub[ 79: 72], after_sub[ 39: 32]
};
// ---- MixColumns: 4 column mixers ----
generate
for (i = 0; i < 4; i++) begin : g_mix
aes_mix_column u_mix (
.col_in (after_shift[32*i +: 32]),
.col_out (after_mix[32*i +: 32])
);
end
endgenerate
// ---- AddRoundKey ----
assign state_out = (last_round ? after_shift : after_mix) ^ round_key;
endmodule
| Platform | Implementation | Throughput | Latency | Notes |
|---|---|---|---|---|
| 8-bit AVR | Byte-serial, table S-box | ~1.5 Mbps | ~2,700 cycles | IoT / smart cards |
| ARM Cortex-M4 | Bitsliced | ~45 Mbps | ~480 cycles | Embedded |
| x86 (no AES-NI) | T-table, OpenSSL | ~3.2 Gbps | ~160 cycles | Software, 4 GHz |
| x86 (AES-NI) | AESENC pipeline | ~8 Gbps | ~40 cycles | Single-core CTR |
| Virtex-5 FPGA | Iterative 128-bit | ~3.2 Gbps | 11 cycles | 1,364 slices |
| Virtex-7 FPGA | Full pipeline | ~24 Gbps | 11 cycles | ~4,200 slices |
| 45nm ASIC | Composite S-box, full pipeline | ~53 Gbps | 10 cycles | 56 kGE, Mathew et al. |
| CUDA GPU | Parallel ECB/CTR | ~135 Gbps | — | RTX 4090, batch mode |
Biclique attack (Bogdanov et al., 2011)
AES-128: 2126.1 AES-192: 2189.7 AES-256: 2254.4
Negligible improvement over brute force — 3–4 bit gain.
Related-key attacks (Biryukov & Khovratovich, 2009)
AES-256: 299.5 (chosen related keys)
Impractical — requires attacker to choose related keys. Addressed in key schedule design.
Square / integral attacks
Effective up to 6–7 rounds. Full AES (10+ rounds) has comfortable security margin.
AES remains unbroken
No practical attack reducing security below the intended level exists as of 2025.
Grover's algorithm reduces brute-force to 2n/2:
AES-128 → ~264 quantum ops
AES-256 → ~2128 quantum ops
Recommendation: use AES-256 for quantum-resistant symmetric encryption.
CNSA 2.0 (NSA) mandates AES-256.
TLS 1.3 (AES-128/256-GCM)
IPsec / IKEv2
SSH
WPA3 (AES-CCM, AES-GCMP)
Signal Protocol (AES-256-CBC)
BitLocker (AES-XTS)
FileVault 2 (AES-XTS-128)
LUKS / dm-crypt
Android FBE
Self-Encrypting Drives (TCG Opal)
Intel AES-NI (Westmere+)
ARM Crypto Extensions (v8-A+)
RISC-V Zkn extension
Qualcomm Inline Crypto Engine
Apple Secure Enclave
Daemen & Rijmen's three core principles:
🔬
S-box from GF(28) inverse — optimal non-linearity. MDS matrix for max diffusion. Wide trail strategy provides provable bounds against differential and linear cryptanalysis.
⚡
T-table fusion for 32-bit CPUs. Byte-oriented for 8-bit µC. Wire-only ShiftRows. Simple xtime() for MixColumns. Parallelizable key schedule.
📐
No secret design constants — everything derived from mathematical first principles. Simple algebraic structure enables thorough analysis and prevents suspicion of trapdoors.
Daemen & Rijmen, "The Design of Rijndael" (Springer, 2002) ch. 5–9
Standards
FIPS 197 — Advanced Encryption Standard (AES), NIST, 2001
NIST SP 800-38A — Modes of Operation
NIST SP 800-38D — GCM Recommendation
CNSA 2.0 — NSA Quantum-Resistant Algorithm Suite
Design & Theory
Daemen & Rijmen, "The Design of Rijndael" (Springer, 2002)
Daemen & Rijmen, AES Proposal: Rijndael v2 (1999)
Shannon, "Communication Theory of Secrecy Systems" (1949)
Hardware
Canright, "A Very Compact Rijndael S-box" (2005)
Mathew et al., "53 Gbps AES in 45nm" IEEE JSSC (2011)
Gaj & Chodowiec, "FPGA and ASIC Implementations of AES" in Cryptographic Engineering (2009)
Cryptanalysis & Side-Channels
Bogdanov et al., "Biclique Cryptanalysis of AES" (ASIACRYPT 2011)
Biryukov & Khovratovich, "Related-Key Attacks on AES-256" (CRYPTO 2009)
Bernstein, "Cache-Timing Attacks on AES" (2005)
Kocher et al., "Differential Power Analysis" (CRYPTO 1999)
Part of the Modern Cryptography Presentation Series