PRESENTATION 09
Timing · Power Analysis · EM Emanations · Cache Attacks · Fault Injection · Masking · Constant-Time Code
Modern Cryptography Series · Navigate with arrow keys or swipe
Cryptographic algorithms are mathematically secure — AES-256 resists 2256 brute-force. But every implementation runs on physical hardware that leaks information through observable side effects.
Attackers don't break the math. They break the physics.
A side channel is any observable physical quantity — time, power, EM radiation, sound, heat — that correlates with secret data processed inside a device.
🔐
Algorithm Security
AES-128: 2128 operations
≈ 1024 years @ 10 GHz
⚡
Side-Channel Attack
DPA on AES-128: 16 × 256 guesses
≈ seconds with an oscilloscope
| Channel | Observable | Access | Key Techniques | Cost |
|---|---|---|---|---|
| Timing | Execution time | Remote / Local | Cache timing, branch analysis | Low |
| Power | Current draw | Physical probe | SPA, DPA, CPA | Medium |
| Electromagnetic | EM radiation | Near-field probe | SEMA, DEMA, cross-device DL | Medium |
| Acoustic | Sound / vibration | Microphone | Coil whine, keyboard acoustics | Low |
| Cache / μArch | Cache state | Co-located SW | Flush+Reload, Prime+Probe, Spectre | Low |
| Fault Injection | Induced errors | Physical access | Voltage/clock glitch, laser, RowHammer | Medium–High |
Classification: Passive (observe only) vs. Active (fault injection). Invasive (decap chip) vs. Non-invasive (external probes) vs. Semi-invasive (expose die, no contact).
Section I
When execution time reveals your secrets
First described by Paul Kocher (1996). If a cryptographic operation takes different amounts of time depending on the secret key, an attacker can statistically recover the key by measuring execution times.
# VULNERABLE — timing leaks key bits
def modexp_vulnerable(base, exp, mod):
result = 1
for i in range(exp.bit_length()-1, -1, -1):
result = (result * result) % mod # always square
if (exp >> i) & 1: # branch on secret!
result = (result * base) % mod # multiply only if bit=1
return result
# Multiply is ~2x slower than square-only
# → attacker times each iteration
# → recovers every bit of the exponent (secret key)
Fix: Always execute both operations — discard unwanted result via constant-time select.
// Standard memcmp — LEAKS position of first difference
int memcmp(const void *a, const void *b, size_t n) {
const uint8_t *p = a, *q = b;
for (size_t i = 0; i < n; i++) {
if (p[i] != q[i])
return p[i] - q[i]; // exits early!
}
return 0;
}
// Timing reveals how many bytes match
// → attacker brute-forces one byte at a time
// → O(256 × n) instead of O(256^n)
// Crypto-safe comparison — always same time
int ct_memcmp(const void *a, const void *b, size_t n) {
const volatile uint8_t *p = a, *q = b;
uint8_t diff = 0;
for (size_t i = 0; i < n; i++) {
diff |= p[i] ^ q[i]; // accumulate diffs
}
// No early exit, no branch on secret data
return diff; // 0 if equal, nonzero otherwise
}
// Runs in exactly the same time regardless
// of where (or whether) a mismatch occurs
Used in OpenSSL's CRYPTO_memcmp(), libsodium's sodium_memcmp(), and Go's crypto/subtle.ConstantTimeCompare().
Replace every secret-dependent branch with branchless bitwise arithmetic:
// Branch predictor leaks which path taken
if (secret_bit) {
result = value_a;
} else {
result = value_b;
}
// Constant-time conditional select
// mask = 0xFFFFFFFF if bit=1, 0x00000000 if bit=0
uint32_t mask = -(uint32_t)secret_bit;
result = (value_a & mask) | (value_b & ~mask);
// Both values computed, correct one selected
// No branch, no timing difference
volatile to prevent dead-code eliminationSection II
Reading secrets from current consumption
In CMOS logic, dynamic power is consumed when transistors switch state. The power drawn by a gate depends on the Hamming weight and Hamming distance of the data it processes.
where α = switching activity (data-dependent!), CL = load capacitance, Vdd = supply voltage, f = clock frequency.
Key insight: α changes with data values. Processing 0xFF → 0x00 (8 bits flip) draws more power than 0xFF → 0xFE (1 bit flip). This creates a measurable correlation between data and power.
Power ∝ number of 1-bits in the value being processed.
Best for: bus loads, register writes where prior state ≈ 0.
Power ∝ number of bit transitions between consecutive values.
Best for: registers, data buses with known prior value.
SPA extracts secrets by directly inspecting a single (or few) power traces. Different operations produce visually distinct power signatures.
Square-and-multiply RSA: a square operation has a shorter, lower-amplitude power pattern than a multiply. By reading the trace visually:
Each multiply reveals a 1 bit; square-only reveals 0. A single trace can recover a 2048-bit RSA private key.
Simulated RSA exponentiation power trace. S = square, SM = square + multiply. Key bits readable directly from amplitude pattern.
Always perform both square and multiply, regardless of key bit. Identical operation sequence for 0 and 1 — SPA sees no difference.
Introduced by Paul Kocher, Joshua Jaffe, and Benjamin Jun (1999). Uses statistical analysis across many traces to extract keys even when individual traces are noisy.
Target: output of first-round SubBytes, S[p[i] ⊕ k[i]]. For each key byte hypothesis k*, compute HW of S[p[i] ⊕ k*] for all traces. Correlate against measured power at the SubBytes clock cycle. The correct k* yields the highest Pearson correlation. Repeat for all 16 bytes → full 128-bit key in 16 × 256 = 4,096 correlation tests.
Refinement of DPA using Pearson correlation coefficient instead of difference-of-means. More statistically efficient — requires fewer traces. Standard attack methodology today.
Simulated CPA correlation traces for an AES-128 SubBytes attack. Click "Run CPA" to see how the correct key byte (hypothesis 0xAB) rises above the wrong guesses as trace count increases.
Cyan = correct key hypothesis (0xAB). Grey = wrong hypotheses (sample of 20). X-axis = number of traces. Y-axis = |Pearson ρ|.
The most powerful form of profiled side-channel attack (Chari et al., 2002). Given access to an identical template device, an attacker can build a statistical model and then break a target device with as few as a single trace.
On a device you control, run encryptions with known keys. For each key byte value, record many traces and compute the multivariate Gaussian distribution (mean vector μ, covariance matrix Σ) of the power at selected "points of interest."
Capture trace(s) from the target device. Evaluate the log-likelihood of each template. The key hypothesis with maximum likelihood is the recovered key.
Modern attacks replace the Gaussian template with neural networks (CNNs, MLPs). Benefits:
Section III
Secrets broadcast over the air
Current flow in IC wires produces electromagnetic radiation. A near-field EM probe (loop antenna, 50–500 μm) placed near the chip surface can capture signals that correlate with internal data processing.
SEMA = Simple EM Analysis (like SPA). DEMA = Differential EM Analysis (like DPA). CEMA = Correlation EM Analysis (like CPA). Same statistical techniques, different physical observable.
📡 EM Probe → 🔧 LNA → 📊 Oscilloscope → 💻 CPA Software
Near-field H-probe → Low-noise amplifier → High-speed ADC → Statistical analysis
Route the crypto core exclusively within lower metal layers (M1–M4), then surround it with a Current-Domain Signature Attenuation (CDSA) shield in upper metals. The correlated EM signature is suppressed by 100–1000× before reaching the chip surface.
Train a CNN on EM traces from Device A, attack Device B — same architecture, different manufacturing lot. Demonstrated to extract AES-256 keys with practical trace counts, proving device-specific noise is not a reliable defense.
Section IV
Software-only side channels — no oscilloscope required
Modern CPUs use multi-level caches (L1: ~1 ns, L2: ~5 ns, L3: ~15 ns, DRAM: ~60 ns). These timing differences are a side channel when crypto code performs secret-dependent memory accesses.
Standard AES implementations use lookup tables (T-tables, 4 × 1 KB). The index into each table depends on plaintext ⊕ key. An attacker on the same machine measures which cache lines were accessed, revealing which table entries were used, directly leaking key-dependent information.
Shared memory required. Attacker flushes a cache line, waits for victim, then reloads — fast reload = victim accessed it. Granularity: cache line (64 bytes).
No shared memory needed. Attacker fills cache set with own data, waits, then re-accesses — slow access = victim evicted a line. Works across VMs on shared L3.
Evict specific cache sets, time victim's total execution. Variation in time reveals which sets victim accesses.
Exploits speculative execution and branch prediction. The CPU speculatively executes instructions beyond a branch before the condition is resolved. Even though results are rolled back architecturally, cache state changes persist and are observable via timing.
Variant 1 (Bounds Check Bypass): Trick victim into speculatively accessing out-of-bounds memory → data encoded into cache → Flush+Reload to read it.
Variant 2 (Branch Target Injection): Poison branch predictor to redirect speculative execution to attacker-chosen gadget.
Exploits out-of-order execution with delayed exception handling. CPU executes a kernel memory read before the privilege check completes. The data reaches the cache before the fault is raised.
Impact: Read arbitrary kernel memory from userspace. All kernel address space — including every process's passwords, keys, and secrets — accessible to any unprivileged process.
KPTI (unmap kernel pages), retpoline (replace indirect branches), IBRS/IBPB (hardware predictor barriers), DAWG (cache partitioning), microcode patches. Performance cost: 5–30% on some workloads.
Section V
Active attacks — inducing errors to extract secrets
⚡
Briefly drop/spike Vdd to violate setup/hold timing. Causes bit flips in registers or skipped instructions. Cost: < $100 (ChipWhisperer). Requires removal of decoupling capacitors for best results.
Practicality: Very High
🕐
Insert an extra short clock pulse or shorten a cycle. CPU reads register before stable — causes computation errors. Cost: < $200. Temporal precision ~0.17 ns achievable on FPGA-based injectors.
Practicality: Very High
📡
High-current EM pulse from a coil induces local currents in the die. Combines spatial targeting with moderate cost. Does not require board modification. Cost: ~$1–5K.
Practicality: High
💥
IR laser focused on exposed die creates electron-hole pairs → bit flips in SRAM/logic. Highest precision (single-transistor targeting). Requires chip decapsulation. Cost: $30–100K.
🔨
Repeated DRAM row activations cause charge leakage in adjacent cells → bit flips. No physical access needed — purely software. Demonstrated: privilege escalation, TLS key extraction, VM escape. Affects DDR3/DDR4; mitigations in DDR5 (TRR) partially bypassable.
Introduced by Biham & Shamir (1997). Inject a fault during encryption, then compare the correct and faulty ciphertexts. The difference reveals information about the secret key.
Inject a single-byte fault at the input of round 8 MixColumns. This fault propagates through rounds 9 and 10, affecting 4 bytes of ciphertext. By analyzing the difference between correct and faulty outputs:
Bellcore Attack (Boneh et al., 1997): If RSA uses CRT and a fault occurs in one of the two modular exponentiations:
A single faulty signature completely breaks RSA. This is why RSA-CRT implementations must verify signatures before output.
• Redundant computation — compute twice, compare results
• Infection — if fault detected, corrupt output completely
• Temporal redundancy — re-execute and compare
• Error-detecting codes — parity/CRC on intermediate values
• Environmental sensors — voltage, clock, temperature monitors
Section VI
Defending implementations against physical leakage
Masking splits every sensitive intermediate value into d+1 random shares. Each share is statistically independent of the secret. An attacker must combine information from all shares simultaneously to learn anything — requiring (d+1)th-order analysis.
Each share xi (for i > 0) is uniformly random. x₀ = x ⊕ x₁ ⊕ ... ⊕ xd. Any strict subset of shares is uniformly distributed — reveals zero information about x.
Used for ARX ciphers (SPECK, ChaCha20). Converting between Boolean and arithmetic masking requires dedicated secure algorithms.
# First-order masked SubBytes (d=1)
def masked_subbytes(x0, x1):
"""x = x0 ⊕ x1 (true value)"""
# Re-mask: generate fresh mask
r = random_byte()
# Pre-compute masked S-box table
# S'[a] = S[a ⊕ x1] ⊕ r (for all a)
masked_sbox = [sbox[a ^ x1] ^ r
for a in range(256)]
# Apply: lookup with share x0
y0 = masked_sbox[x0]
y1 = r # new mask
# Now y0 ⊕ y1 = S[x0 ⊕ x1] = S[x]
return y0, y1
dth-order masking protects against attacks combining up to d intermediate values. Cost grows as O(d²) in computation and O(d) in randomness. Typical: d=1 (first-order) for smart cards, d=2–3 for high-assurance.
Masking in hardware must handle glitches — transient signal transitions that can combine shares within a single clock cycle, leaking information even if the steady-state design is secure.
Combining output shares yields the correct function output.
Each output share depends on at most d input shares (i.e., at least one share is missing from every output computation). This ensures glitches within a single combinational block cannot recombine all shares.
Output shares are uniformly distributed if input shares are. Required for composability of multiple TI stages.
// Inputs: a = (a0, a1, a2), b = (b0, b1, b2)
// where a = a0 ⊕ a1 ⊕ a2, b = b0 ⊕ b1 ⊕ b2
// Output shares of c = a AND b:
assign c0 = (a0 & b0) ^ (a0 & b1) ^ (a1 & b0);
assign c1 = (a1 & b1) ^ (a1 & b2) ^ (a2 & b1);
assign c2 = (a2 & b2) ^ (a2 & b0) ^ (a0 & b2);
// c0 ⊕ c1 ⊕ c2 = a & b ✓
// Each c_i depends on only 2 of 3 input shares ✓
// Non-completeness: c0 has no a2,b2 etc.
Instead of removing the correlation (masking), hiding makes the leakage harder to exploit by adding noise or randomizing timing.
You write constant-time C. The compiler "optimizes" it back into branching code. This is a fundamental tension — compilers are designed to make code fast, not constant-time.
// You write:
uint32_t mask = -(uint32_t)(a == b);
result = old ^ (mask & (old ^ new_val));
// Compiler may emit:
// CMP a, b
// JNE skip ← branch reintroduced!
// MOV result, new_val
// skip:
volatile to prevent optimization-O0 or specific compiler barrierslibsodium / NaCl — Designed constant-time from the ground up. Curve25519 in 32-bit arithmetic, no table lookups, no secret-dependent branches.
OpenSSL (post-2014) — Constant-time BN operations, bitsliced AES, blinded RSA. Legacy codepaths still occasionally vulnerable.
BoringSSL — Google's fork with aggressive constant-time discipline. All crypto ops verified.
AES-NI — Intel hardware AES instructions are inherently constant-time (no table lookups, fixed latency). Eliminates cache-timing attacks on AES entirely.
ct-verif dudect timecop Binsec/Rel CacheAudit ct-wasm
TVLA (Goodwill et al., 2011) is a standardized methodology to detect whether a device leaks side-channel information — without performing a full key-recovery attack.
Collect two sets of power traces:
At each time sample, compute Welch's t-statistic between Set A and Set B.
If |t| > 4.5 at any time point → statistically significant leakage detected (p < 10−5). The device fails TVLA.
Simulated TVLA t-test. Red horizontal lines = ±4.5 threshold. Cyan trace = t-value over time. Spikes exceeding threshold indicate leakage points.
For masked implementations, apply preprocessing before the t-test: square the centered trace values (second-order), or compute products of pairs of samples. If the second-order t-test shows |t| > 4.5 on a first-order masked design, the masking is broken. Extend to nth-order for nth-order masking evaluation.
Section VII
Silicon-level defenses for crypto accelerators
Embed crypto core within a dedicated power regulation block. The SAH absorbs the data-dependent current signature internally, presenting only a flat (or noise-dominated) current profile to the external power pin. Demonstrated 100–1000× increase in traces-to-disclosure on AES-256.
Integrated LDO or switched-capacitor regulator with fast transient response. Supply current drawn from external pin is filtered — attacker sees regulator's signature, not crypto engine's.
High-quality TRNG essential for masking, shuffling, blinding. Common: ring-oscillator based, metastable latch, or SRAM PUF-derived entropy. Must pass NIST SP 800-90B health tests.
Voltage glitch detectors (under/overvoltage), clock frequency monitors, temperature sensors, light sensors (detect decapsulation/laser). On anomaly: halt operation, zeroize keys, alert host.
Top metal layers carry a random bit pattern continuously checked by the security controller. Any physical probing or FIB (focused ion beam) edit that cuts a mesh wire triggers immediate key zeroization. Used in EMV chips, HSMs, Apple Secure Enclave.
Keys stored in OTP/eFuse or SRAM PUF — never exposed on bus. Crypto operations execute inside a hardened island with dedicated clock, power, and bus isolation. ARM TrustZone CryptoCell, Intel AES-NI, dedicated secure elements (SE050, ATECC608).
| Implementation | Order | Area (GE) | Throughput | Randomness / enc | Traces to break |
|---|---|---|---|---|---|
| AES-128 (unprotected) | — | ~2,400 | 54 cycles | 0 bits | ~500 |
| AES-128 + 1st-order TI | d=1 | ~7,500 | 266 cycles | ~48 bits | ~50,000 |
| AES-128 + 1st-order DOM | d=1 | ~5,800 | ~108 cycles | ~18 bits/round | ~50,000 |
| AES-128 + 2nd-order DOM | d=2 | ~11,000 | ~216 cycles | ~54 bits/round | ~108 |
| AES-128 + shuffling + mask | d=1 | ~8,200 | ~160 cycles | ~128 bits | ~106 |
| AES-128 + SAH wrapper | — | ~4,000 + SAH | 54 cycles | 0 bits | > 107 |
GE = gate equivalents (1 GE ≈ 2-input NAND). Cycle counts for round-based architectures at 1 round/cycle. Traces-to-break assumes first-order CPA attack on the respective protection level. Values are approximate — vary with technology node, implementation style, and attack sophistication.
Section VIII
When theory meets silicon
DPA on DES/3DES implementations in banking smart cards recovered PIN encryption keys. Led to mandatory SPA/DPA resistance testing in Common Criteria (EAL4+ with AVA_VAN.5). Modern EMV chips use masking + active shields + environmental sensors.
Power AnalysisBernstein demonstrated timing-based key extraction from OpenSSL AES over a network. Percival extracted RSA keys via L1 cache contention on hyperthreaded Intel CPUs. Led to constant-time bignum operations and AES-NI adoption.
Cache TimingNSA's classified program (later declassified) demonstrated reconstruction of CRT monitor images from EM emanations at distance. Van Eck published civilian version in 1985. Modern extensions: recovering LCD content, keyboard keystrokes from EM.
EM EmanationAffected virtually all modern CPUs (Intel, AMD, ARM). Spectre exploited branch prediction; Meltdown exploited out-of-order execution. Industry-wide response: OS patches, microcode updates, CPU redesigns. Performance impact: 5–30%.
MicroarchitecturalSoftware-controlled undervolting (via DVFS interface) induced faults inside Intel SGX enclaves, breaking AES-NI integrity. VoltPillager showed hardware voltage injection bypassed all software mitigations. Led to Intel locking DVFS in SGX mode.
Fault InjectionRSA key generation in Infineon TPMs used a flawed fast-prime algorithm. Side-channel analysis revealed the structure, allowing factorization of 2048-bit RSA keys from the public key alone. Affected millions of Estonian ID cards, YubiKeys, and TPM-protected machines.
Algorithmic + SCNo single countermeasure is sufficient. Real-world secure implementations combine multiple layers:
Algorithm Layer
• Algorithms with inherent resistance (e.g., Keccak's sponge has no S-box lookup)
• Avoid RSA-CRT without verification
• Prefer X25519 over NIST P-256 (easier to implement in constant time)
Software Layer
• Constant-time code discipline
• Boolean/arithmetic masking
• Shuffling of independent operations
• Blinding (randomize inputs to foil DPA)
• Verified with ct-verif / dudect
Hardware + Physical Layer
• Threshold implementations in RTL
• On-chip noise injection / SAH
• Environmental sensors & active mesh
• TVLA validation at silicon bring-up
• Common Criteria / FIPS 140-3 certification
Randomize the input to the private-key operation so the attacker cannot correlate known inputs with power/timing traces.
# RSA signature with blinding
def rsa_sign_blinded(m, d, n):
# Generate random blinding factor
r = random_coprime(n)
r_inv = modular_inverse(r, n)
# Blind the message
m_blind = (m * pow(r, e, n)) % n
# Sign the blinded message
s_blind = pow(m_blind, d, n) # leaks info about r, not d!
# Unblind the signature
s = (s_blind * r_inv) % n
return s # correct signature, no leakage
Every invocation uses a fresh random r, so the intermediate values during exponentiation are completely randomized. Attacker's traces are decorrelated from the actual key bits.
For elliptic curve scalar multiplication k·G:
Replace k with k + r·n where n is the group order and r is random. Since n·G = O (point at infinity), the result is unchanged: (k + r·n)·G = k·G. But the scalar bits seen by the multiplier are randomized.
Add a random point R to the input, compute k·(G + R), then subtract k·R. Requires precomputation of k·R via a separate (protected) path.
Represent point (X, Y, Z) as (λX, λY, λZ) for random λ. Same projective point, but all intermediate coordinate values are randomized. Near-zero overhead.
Every physical implementation leaks information through time, power, EM, cache state, sound, and fault responses. Mathematical security guarantees do not automatically transfer to real hardware/software.
SPA/DPA/CPA extract keys by observing power consumption. Cache timing attacks (Flush+Reload, Spectre) work purely in software. Template attacks and deep-learning SCA approach single-trace key recovery.
Voltage/clock glitching, EMFI, and laser FI induce faults for DFA. RowHammer is a software-only fault injection. A single faulty RSA-CRT signature breaks the key.
Masking (Boolean, arithmetic, threshold implementations) removes statistical dependence. Hiding (shuffling, dual-rail logic, noise injection) reduces SNR. Constant-time code eliminates timing channels. Blinding randomizes protocol inputs. Hardware sensors detect fault injection.
No single technique is foolproof. Production systems layer masking + hiding + constant-time + blinding + hardware sensors + TVLA testing. Certification standards (Common Criteria, FIPS 140-3, EMVCo) mandate multi-layer SCA resistance.
Deep-learning SCA, cross-device attacks, and microarchitectural exploits continue to advance. Post-quantum algorithms (ML-KEM, ML-DSA) introduce new SCA surfaces. The field remains an active frontier of both attack and defense research.
Foundational Papers
• Kocher, "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems" (1996)
• Kocher, Jaffe, Jun, "Differential Power Analysis" (1999)
• Chari, Rao, Rohatgi, "Template Attacks" (CHES 2002)
• Biham & Shamir, "Differential Fault Analysis" (1997)
• Boneh, DeMillo, Lipton, "On the Importance of Checking Crypto Computations" (1997)
• Piret & Quisquater, "A DFA Technique Against SPN Structures" (2003)
• Bernstein, "Cache-Timing Attacks on AES" (2005)
Spectre / Meltdown
• Kocher et al., "Spectre Attacks: Exploiting Speculative Execution" (2018)
• Lipp et al., "Meltdown: Reading Kernel Memory from User Space" (2018)
Countermeasures
• Nikova, Rechberger, Rijmen, "Threshold Implementations Against Side-Channel Attacks and Glitches" (2006)
• Ishai, Sahai, Wagner, "Private Circuits: Securing Hardware Against Probing Attacks" (2003)
• Das & Sen, "EM and Power SCA: Advanced Attacks and Low-Overhead Countermeasures" (2020)
• Schneider, Moradi, Güneysu, "Arithmetic Addition over Boolean Masking" (2015)
• Goodwill et al., "TVLA — A Testing Methodology for SCA Resistance" (2011)
Standards & Specifications
• NIST SP 800-90B (Entropy Sources)
• FIPS 140-3 (Cryptographic Module Validation)
• Common Criteria ISO/IEC 15408 (EAL4+ AVA_VAN.5)
• EMVCo Security Evaluation Standards