Companion to Spectral Audio Signal Processing

01Prologue — Why Spectral Audio

Julius O. Smith's third online book, Spectral Audio Signal Processing (SASP), is not really about the DFT. The DFT is covered in his first book, the STFT in his second. SASP is about everything that follows a spectrum: how to read it, how to parameterise it, how to track its peaks across time, how to model its envelope, how to re-sharpen it after a window has smeared it, and how to synthesise sound from the numbers that come out the other end.

This companion walks through SASP chapter by chapter, with live visualisations of every important construction. Each demo is computed in your browser with a hand-written radix-2 FFT and the Web Audio API — no servers, no dependencies except KaTeX for the mathematics.

What SASP adds beyond the earlier books

Smith's four online books form a tower. Mathematics of the DFT (MDFT) builds the bare transform. Introduction to Digital Filters (Filters) builds the LTI machinery. Physical Audio Signal Processing (Physical) builds strings, tubes, rooms. Spectral Audio Signal Processing (SASP) lives in between — it is the book that answers the question "now that I have a spectrum, what do I do with it?".

Book	Lives at	Answers
MDFT	Signal → transform	What is a DFT?
Filters	Transform → LTI design	How do I shape sound?
SASP	Transform → parameters	What is in this spectrum?
Physical	Parameters → sound	How do strings and tubes sound?

What you should already know

If you have worked through the STFT for Sound companion, you are ready. This guide does not re-derive windowing, COLA, or the phase-vocoder phase update — those sit in the STFT guide. Here we stand above that layer: we assume you already have a spectrogram and we ask what to do with it.

The hero canvas above is not decoration — it is the key image of this book. Each bright dot is a sinusoid with a frequency (vertical axis), a time (horizontal axis), and an amplitude (brightness). The trails behind the dots are the tracks that the McAulay–Quatieri algorithm will draw in chapter 4. Every major chapter of SASP is an answer to a question about those dots.

Cross-links to siblings

This companion is one of a family of interactive audio-DSP guides. Links that matter for what follows:

STFT_for_Sound Wavelets_for_Sound Constant_Q_Transform_for_Sound MFCC_for_Sound MDCT_for_Sound Sound_Transforms_History DDSP_Differentiable_DSP PhaseVocoder MusicalTimeStretchingPitchShifting Synth_Additive Synth_Spectralist

And the sibling Companion repos on the four JOS books:

Companion_JOS_Mathematics_of_the_DFT Companion_JOS_Introduction_to_Digital_Filters Companion_JOS_Physical_Audio_Signal_Processing Companion_JOS_Audio_Signal_Processing_in_Faust

02STFT Refresher — What Peaks Look Like

The STFT turns a signal into a grid of complex numbers. For parametric modelling we care about peaks in the log-magnitude of each frame, because peaks are where sinusoids live. This chapter is a refresher with a specifically parametric eye; the mechanics of windowing and hop sizing are covered in full in the STFT_for_Sound companion.

The single-sinusoid picture

When you compute the DFT of a windowed pure tone, you do not see a delta function — you see the window's Fourier transform shifted to the tone's frequency. Every window has a characteristic mainlobe width (which controls the minimum resolvable frequency spacing), and a characteristic sidelobe level (which controls how much a strong sinusoid can mask a weak one nearby).

$$X_w[k] \;\approx\; \tfrac{A}{2}\, W\!\left(\omega_k - \omega_0\right) e^{j\phi_0} \;+\; \tfrac{A}{2}\, W\!\left(\omega_k + \omega_0\right) e^{-j\phi_0}$$

When the two copies of $W$ do not overlap appreciably, the positive-frequency peak is a clean image of $W$. Parametric methods exploit that: they fit a known mainlobe shape to the few bins around the peak and read off the parameters $(A, \omega_0, \phi_0)$ directly.

Interactive: window library with parametric readouts

Window Length 128 Zero-pad × 8

···

Left: window in time. Right: zero-padded FFT magnitude in dB. Readout reports main-lobe 6 dB width, highest sidelobe, and scalloping loss — the three numbers that decide whether a window is good for parametric extraction.

Zero-padding and the interpolation illusion

Zero-padding the time signal before the FFT does not add information; it interpolates the existing transform. But interpolation is exactly what we want for peak estimation — it lets the DFT samples land closer to the true peak. A factor of 4 or 8 is usually plenty, because after that the parabolic interpolation trick (next chapter) is more efficient than brute zero-padding.

The important lesson: a "good window for parametric extraction" is one whose mainlobe is close to a parabola in dB near the peak. Hann, Hamming, and Blackman all qualify; Rectangular does not.

03Quadratic Peak Interpolation

You have a DFT frame. A sinusoid is in there somewhere near bin $k^\star$, the bin of the highest magnitude. But the true frequency is almost certainly between bins — you need a sub-bin estimate. The standard industrial trick, taught in SASP and used in every serious sinusoidal analyser, is a three-point parabolic fit in dB around the peak.

The formula

Let $a = 20\log_{10}|X[k^\star{-}1]|$, $b = 20\log_{10}|X[k^\star]|$, and $c = 20\log_{10}|X[k^\star{+}1]|$. Fit a parabola through those three dB values and find its maximum. The offset from the centre bin is

$$\delta \;=\; \frac{1}{2}\,\frac{a - c}{a - 2b + c}\;\in\;[-\tfrac12,\tfrac12]$$

and the interpolated peak value in dB is

$$\hat b \;=\; b \;-\; \tfrac{1}{4}(a-c)\,\delta.$$

The true frequency estimate is $\hat\omega = \omega_{k^\star} + \delta \cdot (2\pi/N)$ radians/sample, and the true amplitude estimate is $10^{\hat b / 20}$ after correcting for the window gain.

Why dB, not linear, not squared?

A parabola is a quadratic; it fits any smooth local maximum to second order in a Taylor expansion. The question is which vertical axis makes the second-order approximation most accurate. Abe and Smith showed (2004) that for the usual windows — Hann, Hamming, Blackman, Gaussian — the mainlobe shape on a dB axis is very close to parabolic; the residual second-order error is tiny. On a linear or squared axis the mainlobe is much less parabolic, so a linear parabolic fit leaves systematic bias of several percent of a bin.

The Gaussian window is special: its mainlobe is exactly a parabola on a dB axis, so parabolic interpolation is exact (apart from the sidelobe-overlap bias below). This is one reason Serra's SMS analyser uses a Gaussian-like window.

Abe and Smith's bias

The estimator above is biased for two reasons, both discussed in detail in SASP. First, the mainlobe on a dB axis is only approximately parabolic; the residual third-derivative shape introduces a small bias that depends on the offset. Second, the negative-frequency image of the same sinusoid leaks a little energy into the positive-frequency peak — more for short windows, less for long ones — which skews the three-point estimate slightly towards DC.

For a Hann window of length $L$ at a tone of digital frequency $\omega_0$, the bias is roughly

$$\Delta\omega \;\approx\; -\frac{C}{L^{\,2}\,\omega_0}$$

for a small constant $C$ that depends on the window. In practice the bias is below $10^{-3}$ bins for $L \ge 1024$ and any audio frequency. Parabolic interpolation is overwhelmingly the right tool.

Interactive: click any peak to see the estimate

True frequency (bins) 34.37 Amplitude 0.80 Window Length Noise 0.02

···

Blue stems: DFT magnitude in dB. Orange curve: parabola fitted through three highest bins. Dashed vertical: true frequency. Dotted vertical: raw-bin estimate. Solid orange: parabolic estimate.

Move the frequency slider slowly and watch the raw-bin estimate jump by a whole bin at a time while the parabolic estimate tracks continuously. This is why every sinusoidal analyser from McAulay–Quatieri onward uses quadratic interpolation rather than arg-max alone.

Phase interpolation

Amplitude and frequency are not the only parameters you can read from a mainlobe; the phase at the true peak is recoverable too. With the same three-bin fit, the interpolated phase is obtained by evaluating the linear fit of the two complex bin values at the interpolated frequency — effectively

$$\hat\phi \;=\; \arg X[k^\star] \;+\; \delta\cdot\bigl(\arg X[k^\star+1] - \arg X[k^\star-1]\bigr)/2,$$

with the usual care about phase unwrapping. This is the third parameter $(A,\,\omega,\,\phi)$ that sinusoidal modelling needs per peak per frame.

04Sinusoidal Modelling — McAulay & Quatieri

In 1986 Robert J. McAulay and Thomas F. Quatieri, working at MIT Lincoln Laboratory on a military speech coder, published the paper that started a branch of audio DSP. Speech Analysis/Synthesis Based on a Sinusoidal Representation (IEEE ASSP 34(4), 744–754) proposes that any voiced speech signal can be well approximated by a short, rapidly time-varying sum of sinusoids:

$$\hat s(n) \;=\; \sum_{p=1}^{P(n)} A_p(n)\,\cos\!\bigl(\phi_p(n)\bigr).$$

The parameters $\{A_p(n), \phi_p(n)\}$ evolve smoothly in time along tracks. A track is born at some frame, lives for a while, and dies when it can no longer be matched. The analysis stage turns a signal into a list of tracks; the synthesis stage turns a list of tracks back into a signal.

The analysis pipeline

Take a windowed DFT at frame $m$ (hop $H$).
Pick the local maxima of the log-magnitude spectrum.
For each peak, use quadratic interpolation to estimate $(A_p, \omega_p, \phi_p)$.
Match peaks in frame $m+1$ to tracks alive at frame $m$.
Unmatched tracks die; unmatched peaks give birth to new tracks.

The matching rule

McAulay and Quatieri match greedily by frequency proximity, with a threshold $\Delta$ in Hz. Given the frequencies $\{\omega_i^{(m)}\}$ at frame $m$ and $\{\omega_j^{(m+1)}\}$ at frame $m+1$, a track $i$ is continued by the frame-$m{+}1$ peak $j$ that minimises $|\omega_i - \omega_j|$ subject to $|\omega_i - \omega_j| < \Delta$. Peaks inside $\Delta$ without a free partner are birthed; tracks with no match within $\Delta$ are killed.

$$\text{match}(i) \;=\; \arg\min_{j} |\omega_i^{(m)} - \omega_j^{(m+1)}|, \qquad |\omega_i^{(m)} - \omega_j^{(m+1)}| < \Delta$$

The threshold $\Delta$ is usually a few Hz per ms of hop; the exact number is less important than the fact that one exists. The original MQ paper uses about $\Delta \approx 0.5$ semitones, i.e.\ about 3–5% of the frequency value.

The synthesis stage — phase-continuous oscillator interpolation

Between adjacent frames a track has a known frequency, amplitude, and phase at each end. The synthesiser must produce samples that match those two end-point conditions and keep the phase continuous, i.e.\ the derivative of phase should be smooth across the frame boundary. McAulay and Quatieri chose a cubic phase polynomial: amplitude is linearly interpolated between frames, phase is a cubic whose value and derivative at the two ends match $\phi_p$ and $\omega_p$:

$$\tilde\phi(t) \;=\; \phi_m \;+\; \omega_m\,t \;+\; \alpha\, t^2 \;+\; \beta\, t^3$$

with $(\alpha,\beta)$ chosen uniquely so that $\tilde\phi(H) \equiv \phi_{m+1} \pmod{2\pi}$ and $\tilde\phi'(H) = \omega_{m+1}$. The phase-unwrap step that resolves the integer ambiguity in $\phi_{m+1}$ is the MQ analogue of the phase-vocoder princarg update; SASP gives the closed form.

Interactive: MQ track visualiser

The flagship demo. Pick a signal and watch the peak-picking and the track continuation. Each track is a coloured polyline in the time-frequency plane; the line breaks where the track dies. Toggle the match threshold and see tracks merge, fragment, or disappear entirely. Switch to the audio panel and resynthesise from the tracks — with cubic-phase interpolation the resynthesis is indistinguishable from the original for stationary voiced signals; with piecewise-linear phase you hear characteristic artefacts.

Signal Match Δ (Hz) 40 Peak floor (dB) -50 Frame length

···

Top: spectrogram. Bottom: MQ tracks in the time-frequency plane — each colour is one track. Change Δ to see birth/death behaviour shift.

Birth/death in MQ tracking is the visual echo of the hero canvas at the top of this page. Each bright dot up there is a peak in one frame; the trail it leaves is a track.

05Spectrum Modelling with Residuals — SMS

Pure sinusoidal modelling works beautifully for stationary pitched sounds, but it has nothing to say about breath, bow noise, hammer thuds, cymbal hash, or the hiss of a held $/s/$. In 1990 Xavier Serra — then Smith's PhD student at CCRMA — published Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic Decomposition, extending MQ by explicitly separating signals into a sinusoidal deterministic part and a noise-like stochastic part.

$$s(n) \;=\; \underbrace{\sum_{p=1}^{P} A_p(n)\,\cos\phi_p(n)}_{\text{deterministic}} \;+\; \underbrace{e(n)}_{\text{residual}}.$$

How SMS estimates the residual

The deterministic part is exactly McAulay–Quatieri: track-based sinusoidal resynthesis. The residual $e(n) = s(n) - \hat s(n)$ in the time domain is then analysed separately. Serra observes that for musical signals $e(n)$ is very well modelled as filtered noise: a coloured random process whose short-time spectral envelope can be sampled at the same frame rate as the sinusoids. Synthesis becomes deterministic-cosine-bank plus filtered-white-noise convolution.

The residual envelope can be estimated by smoothing $|E(m,k)|$ along $k$ with a moving average, or equivalently by fitting a low-order cepstral or LPC envelope to the residual's log-magnitude. SMS uses line-segment approximations; more recent systems use LPC or the true envelope from the next chapter.

Interactive: deterministic vs stochastic

Signal Sinusoids kept 20

···

Top: original spectrogram. Middle: deterministic-only spectrogram (bank-of-sinusoids). Bottom: residual spectrogram.

The ratio of sinusoidal energy to residual energy is a powerful diagnostic for musical timbre. A pure flute sits near 100% deterministic; a cymbal sits near 0%; a voiced vowel with breath sits at roughly 70–80%. SMS was one of the first analysis systems that could measure this ratio continuously in time, and it influenced every later CASA system.

06Spectral Envelope Estimation

A harmonic sound has a fine structure (the comb of harmonics at multiples of $f_0$) and an envelope (the smooth curve that the harmonic tips sit on). The envelope encodes the timbre — the vocal tract filter, the body of an instrument, the resonance of a room. Three classical methods for estimating it are covered in SASP.

Cepstral liftering

The cepstrum is the IDFT of the log-magnitude spectrum. Fine structure has rapid variation across frequency — high-quefrency cepstrum — while the envelope has slow variation — low-quefrency cepstrum. "Liftering" (filtering in cepstrum domain) keeps only the lowest $Q$ quefrency bins and discards the rest.

$$\text{envelope}(k) \;=\; \exp\!\bigl(\mathrm{DFT}\bigl[c_n \cdot \mathbb{1}_{|n| \le Q}\bigr]\bigr)$$

The cepstral envelope is very cheap, but it tends to under-shoot formant peaks because harmonic peaks and valleys average out equally.

True envelope (Röbel & Rodet)

Iteratively take the cepstral envelope, then replace any spectral value below the envelope by the envelope itself; repeat. This "cepstral discriminator" converges to the envelope that touches the harmonic peaks from above — the true envelope. Röbel and Rodet (2005) proved convergence; the method is now a standard feature in tools like SuperVP, World, and phase-vocoder pitch shifters that need to preserve formants.

Linear prediction (LPC)

A different route. Assume the signal is the output of an all-pole filter driven by white noise (or a pulse train). Find the filter coefficients $\{a_1,\dots,a_p\}$ that minimise the one-step prediction error

$$e(n) \;=\; x(n) + \sum_{i=1}^{p} a_i\, x(n-i).$$

The squared filter response $|1/A(e^{j\omega})|^2$ is then the LPC spectrum, and its peaks are the formants. LPC is historically the workhorse of speech coding (Atal & Schroeder at Bell Labs, Itakura & Saito at NTT, Markel & Gray), and it is still used inside most modern vocoders. The Levinson–Durbin recursion that solves for the $a_i$ in $\mathcal{O}(p^2)$ appears in chapter 10.

Interactive: LPC order slider on a vowel

Vowel LPC order p 14 Show

···

Blue: DFT magnitude (dB). Orange: LPC envelope. Green: cepstral envelope. Red ticks: formant peaks detected on the LPC envelope.

07Pitch Estimation

Pitch is one of the oldest problems in audio DSP. SASP covers the three main families: autocorrelation in the time domain, cepstrum in the quefrency domain, and the modern difference-function family (YIN). Each family decides which samples of the signal look most like copies of each other at lag $T_0$.

Autocorrelation

$$r(\tau) \;=\; \sum_{n=0}^{L-1-\tau} x(n)\,x(n+\tau)$$

A periodic signal at period $T_0$ has a peak at $\tau = T_0$. Find the highest peak away from $\tau = 0$. Cheap and effective for clean pitched signals; fails on octave confusion when the signal has weak fundamental or strong partial.

Cepstrum

The cepstrum of a voiced sound has a spike at quefrency $T_0$ — the period of the harmonic comb in the log-magnitude spectrum. Noll proposed it in 1967 and it remained the standard speech pitch tracker into the 1990s. Picks the right octave more reliably than autocorrelation.

YIN — de Cheveigné & Kawahara (2002)

The pitch tracker inside most modern audio software. YIN defines the difference function

$$d(\tau) \;=\; \sum_{n=0}^{L-1-\tau} \bigl(x(n) - x(n+\tau)\bigr)^{2}$$

which is zero at $\tau = T_0$ for a perfect periodic signal. For real signals it has a deep local minimum there. The second YIN insight is the cumulative mean normalised difference function

$$d'(\tau) \;=\; \begin{cases} 1 & \tau = 0\\[3pt] \dfrac{d(\tau)}{\frac{1}{\tau}\sum_{j=1}^{\tau} d(j)} & \tau > 0\end{cases}$$

which starts at 1 and dips well below 1 at true periods. Pick the smallest $\tau$ for which $d'(\tau)$ falls below an absolute threshold (typically $0.1$), and parabolically interpolate around it. YIN's robustness against noise and octave errors made it the de-facto standard, and it underlies the pitch trackers in Ardour, Ableton Live's Melodyne-like tools, Audacity's pitch-detect plugin, and the librosa.pyin default.

Interactive: YIN on a pitched signal with noise

Frequency (Hz) 220 SNR (dB) 20 Threshold 0.10

···

Top: raw difference function d(τ). Bottom: cumulative normalised d'(τ) with threshold line. Orange vertical: detected period.

08Time–Frequency Reassignment

A spectrogram cell is wide in time and wide in frequency — the Heisenberg–Gabor box. The energy that falls inside that cell is not uniformly distributed across the cell; for a single sinusoid, all of it is concentrated at a single point. Reassignment is a standard post-processing trick that moves each spectrogram pixel to the centre of gravity of its energy, producing a far sharper picture without changing the underlying FFTs.

Kodera, Gendrin, de Villedary (1976)

Kodera and colleagues proposed the principle in a 1976 paper on geophysical signal analysis: move each spectrogram value from its grid cell $(t_m, \omega_k)$ to the reassigned position

$$(\hat t, \hat\omega) \;=\; \bigl(t_m - \Re\{\partial_\omega \log X\},\, \omega_k + \Im\{\partial_t \log X\}\bigr).$$

Auger and Flandrin (1995)

Patrick Flandrin and his student François Auger gave the closed-form discrete version, which is cheap to compute. Given the STFT with window $w$, you also compute two "helper" STFTs: one with window $t\,w(t)$ (time ramp), and one with window $w'(t)$ (window derivative). The reassignment coordinates are

$$\hat t \;=\; t_m + \Re\!\left\{\frac{X_{tw}(t_m,\omega_k)}{X_{w}(t_m,\omega_k)}\right\},\qquad \hat\omega \;=\; \omega_k - \Im\!\left\{\frac{X_{w'}(t_m,\omega_k)}{X_{w}(t_m,\omega_k)}\right\}.$$

For a single chirp, reassignment turns the smeared spectrogram into an almost pixel-perfect line, and for a bell it turns the smeared striae into tight clusters at the partials. The cost is a loss of interpretability for mixed regions — where a sinusoid and an impulse overlap, reassigned pixels can go in both directions.

Interactive: reassigned spectrogram on chirp + bell

Signal Frame length Reassign

Top: ordinary spectrogram. Bottom: reassigned spectrogram computed with the Auger-Flandrin formulas using three STFTs.

09The Chirp-Z Transform

The DFT samples $H(z)$ on the unit circle at $N$ equally spaced angles. The chirp-z transform (Rader, Schafer & Rabiner 1969, building on Rader 1968) samples $H(z)$ on an arbitrary spiral starting at any point and stepping by any complex ratio. This lets you zoom into a narrow frequency band of an FIR filter, or evaluate the filter's response on a near-unit-circle contour to inspect pole damping — things the plain FFT cannot do without enormous zero-padding.

$$X_k \;=\; \sum_{n=0}^{N-1} x(n)\,A^{-n} W^{nk},\qquad k=0,\ldots,M-1,$$

where $A = A_0 e^{j\theta_0}$ is the start point and $W = W_0 e^{j\phi_0}$ is the step. For $A_0 = 1$, $\theta_0 = 0$, $W_0 = 1$, $\phi_0 = 2\pi/N$ you recover the DFT.

Why it matters

Bluestein (1970) showed that $X_k$ above can be computed by three FFTs of length $\ge N + M - 1$. That makes the chirp-z transform practical, and it is how modern software evaluates a single narrow band of a spectrum at arbitrary resolution — think of the high-resolution zooming on a DAW spectrum analyser around a harmonic of a kick drum.

Interactive: pick a spiral, see the zoomed spectrum

Centre (Hz) 800 Span (Hz) 200 Radius 1.000

···

Left: the z-plane with unit circle and the chosen spiral contour. Right: full DFT of the signal in dB (blue) with the CZT-zoomed slice (orange) plotted on top. Move the span slider to the left to see the chirp-z's fine resolution over the narrow band.

10Linear Prediction — Levinson and the Lossless Tube

Linear prediction is such a central tool in audio DSP that SASP devotes a full chapter to it separately from the envelope-estimation use in chapter 6. Three ideas interlock: the Yule–Walker normal equations, the $\mathcal{O}(p^2)$ Levinson–Durbin recursion, and the lossless acoustic-tube interpretation of the reflection coefficients.

The normal equations

Minimising $E = \sum e(n)^2$ by setting $\partial E/\partial a_i = 0$ gives the Toeplitz system

$$\begin{pmatrix} r(0) & r(1) & \cdots & r(p-1) \\ r(1) & r(0) & \cdots & r(p-2) \\ \vdots & & \ddots & \vdots \\ r(p-1) & r(p-2) & \cdots & r(0) \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \\ \vdots \\ a_p \end{pmatrix} = -\begin{pmatrix} r(1) \\ r(2) \\ \vdots \\ r(p) \end{pmatrix}.$$

Levinson–Durbin

Rather than invert the $p \times p$ Toeplitz matrix, we ascend through orders $1, 2, \ldots, p$. At order $i$ the reflection coefficient is

$$k_i \;=\; -\frac{r(i) + \sum_{j=1}^{i-1} a_j^{(i-1)}\, r(i-j)}{E_{i-1}},$$

the new predictor coefficients are $a_j^{(i)} = a_j^{(i-1)} + k_i\,a_{i-j}^{(i-1)}$, and the prediction error power updates as

$$E_i \;=\; E_{i-1}\,(1 - k_i^{\,2}).$$

Stability: if all $|k_i| < 1$ then $A(z) = 1 + \sum a_i z^{-i}$ has all zeros inside the unit circle, so $1/A(z)$ is stable. The algorithm is Leroy Levinson's doctoral recursion (1947), rediscovered in the signal processing context by J. P. Burg, J. Durbin (1960), and put to work in speech by B. S. Atal and L. R. Rabiner.

The lossless tube interpretation

The reflection coefficients $\{k_i\}$ have a beautiful physical meaning: they are literally the boundary reflection coefficients of a lossless acoustic tube made of $p+1$ uniform cylindrical sections. A positive $k_i$ means the tube widens, a negative $k_i$ means it narrows. This insight, due to Kelly and Lochbaum in the 1960s (and developed by Markel and Gray), is why LPC analysis and articulatory vocal-tract modelling converged: $\{k_i\}$ from speech analysis give you a tube, and the tube reproduces the speech. The cross-link to Physical Audio Signal Processing is direct.

Interactive: Levinson–Durbin

Vowel Order p 12

···

Left: reflection coefficients k_i bar chart (green: |k|<1 stable). Right: prediction-error power E_i as a function of order (log scale) — a "knee" indicates that more coefficients stop helping.

11Phase-Vocoder Variants

The phase vocoder — Flanagan & Golden's 1966 Bell Labs idea to treat each STFT bin as an amplitude-modulated, frequency-modulated carrier — lives inside almost every audio time-stretcher and pitch-shifter. The mechanics of the phase-unwrap update belong in the STFT companion, which derives them in full; here we survey the variants SASP focuses on.

Flanagan & Golden (1966)

The original. Each analysis bin carries a magnitude and an instantaneous frequency; each synthesis bin reconstructs a sinusoid from those. In its plain form it suffers from "phasiness": sinusoids that straddle two bins (which is most of them) lose their strict phase coupling when each bin's phase is advanced independently.

Laroche & Dolson identity phase locking (1999)

Jean Laroche and Mark Dolson — at IRCAM and E-Mu respectively — fixed phasiness with a small change. Detect the local magnitude peaks in every frame; for each peak, advance its phase by the usual phase-vocoder update; but for the two neighbours on each side, copy the peak's phase increment so the relative phase across the three-bin mainlobe is preserved. The result sounds dramatically cleaner.

$$\tilde\phi_{k\pm 1}^{(m)} \;=\; \tilde\phi_{k}^{(m)} \,+\,\bigl(\angle X_{k\pm 1}^{(m)} - \angle X_{k}^{(m)}\bigr).$$

Griffin–Lim magnitude-only reconstruction (1984)

If you have only a magnitude spectrogram — say from a neural model — the Griffin–Lim alternating projection is the classical way to invent a plausible phase. It is covered in detail in the STFT for Sound companion. Modern neural vocoders (WaveNet, HiFi-GAN, WaveGlow) beat Griffin–Lim on perceived quality, but it remains the dependency-free baseline.

Sinusoidal-model phase vocoder

SASP's distinctive contribution is the observation that the plain phase-vocoder phase-update equations are exactly what fall out of sinusoidal modelling when you identify each bin with a single track. The MQ synthesiser is a phase vocoder, one oscillator per track, with the bonus that birth/death resolve explicitly. A sinusoidal-model time-stretch stretches tracks, not bins, and does not suffer from phasiness at all. Modern tools like SPEAR, Loris, and SMSTools implement this approach.

A working pitch-shifter you can inspect live: MusicalTimeStretchingPitchShifting. The companion repo PhaseVocoder carries a minimal reference implementation.

12Applications

The methods in chapters 3–11 are the building blocks for four broad application families in audio. Each is an active area of research and each runs on the same few primitives: pick peaks, track them, estimate envelopes, separate, resynthesise.

Additive resynthesis with track editing

The earliest application of MQ/SMS analysis. Edit the tracks — move them in pitch, stretch them in time, remove some — and resynthesise. SPEAR, Loris, and SMSTools all implement this. The interactive SMS demo in chapter 5 is a miniature version: the "deterministic" button is exactly additive resynthesis from the retained peaks.

Pitch-shifting that preserves timbre

Plain resampling shifts pitch and formants together, producing the "chipmunk" effect. A sinusoidal-model pitch shift scales the frequencies but keeps the amplitudes driven by the spectral envelope of the original — so when the pitch goes up, the formants stay put. This is how Melodyne, Auto-Tune, Ableton's Complex Pro, and Logic's Flex Pitch all work under the hood. The envelope is usually a cepstral or true envelope, sampled at the new partial frequencies.

Sinusoidal-model denoising

If a signal is known to be mostly a small number of sinusoids — a solo violin, a sustained vowel — you can denoise aggressively by resynthesising only the tracks that live long enough and are loud enough. Noise shows up as short, faint peaks that never link into tracks; they are discarded by the track-length and amplitude thresholds. This approach is the ancestor of modern non-negative matrix factorisation denoisers.

Adaptive spectral whitening

Invert the estimated spectral envelope (LPC, cepstral, or true) to get a whitening filter, apply it, and you have a flattened residual that is much easier to analyse for onsets, pitch, or chord detection. Adaptive whitening is baked into almost every modern music-information-retrieval pipeline.

Differentiable spectral modelling (DDSP)

Engel, Hantrakul, Gu, and Roberts (2020) showed that sinusoidal plus filtered-noise decomposition can be made differentiable and trained end-to-end with a neural controller, producing the Differentiable Digital Signal Processing framework. DDSP is the direct neural descendant of SMS: sinusoids plus stochastic residual, with the parameters emitted by a neural net rather than hand-tuned. See the DDSP_Differentiable_DSP companion for an interactive treatment.

13Legacy and Continuing Impact

Smith's SASP is the Rosetta Stone between classical spectrum analysis and today's neural-audio world. Its vocabulary — tracks, deterministic plus stochastic, reassignment, cepstral envelope, reflection coefficients — is the vocabulary of every modern spectral audio system, even the ones whose parameters are learned instead of estimated.

Software descendants

SPEAR (Klingbeil, 2005): sinusoidal-track editor with MQ analysis and additive resynthesis; educational and still widely used.
SMSTools (Serra and the UPF Music Technology Group): the canonical SMS implementation; teaches the sinusoids-plus-residual decomposition.
Loris (Fitz & Haken, 2002): partials tracking with bandwidth-enhanced partials — SMS with a noise bandwidth per partial instead of a separate residual.
World (Morise, 2016): fast speech analysis with DIO pitch, CheapTrick envelope, and PLATINUM aperiodicity — all built on SASP primitives.
SuperVP (IRCAM): the engine behind AudioSculpt, with true-envelope, phase-vocoder time-stretch, and track editing.

Coding standards

MPEG-4 HILN (Harmonic and Individual Lines plus Noise, Purnhagen & Meine 2000) codes low-bitrate audio with an explicit sinusoids-plus-noise model — SMS in a standards document. Even where MDCT has won the codec wars (AAC, Opus), residual noise shaping borrows the SMS stochastic-part idea.

Neural spectral vocoders

WaveNet (van den Oord 2016), Parallel WaveNet (2017), WaveRNN (2018), MelGAN (2019), HiFi-GAN (2020), BigVGAN (2022) all map log-mel spectrograms to waveforms with neural networks. None of them explicitly uses MQ tracks or LPC, but the input representation they consume — log-mel spectrogram — is exactly what SASP's chapter on the STFT gives you. Their output quality rises in step with the fidelity of their handling of the sinusoidal part of the spectrum, which is still the hardest thing for a neural net to reproduce cleanly.

Differentiable spectral modelling

DDSP (Engel et al. 2020) closed the circle: sinusoidal-plus-noise synthesis implemented with differentiable operations, parameters emitted by a neural controller, trained end-to-end. The result is orders of magnitude more parameter-efficient than raw neural vocoding and inherits the interpretability of SMS. As of 2025, DDSP-style hybrid models are behind several commercial synths and are the core of the Magenta project.

Related guides in this series

And the sibling Companion repos on the four JOS books:

Companion_JOS_Mathematics_of_the_DFT Companion_JOS_Introduction_to_Digital_Filters Companion_JOS_Physical_Audio_Signal_Processing Companion_JOS_Audio_Signal_Processing_in_Faust

Key references

Smith, J. O. (2011). Spectral Audio Signal Processing. W3K Publishing. Online at ccrma.stanford.edu/~jos/sasp/.
McAulay, R. J. & Quatieri, T. F. (1986). Speech Analysis/Synthesis Based on a Sinusoidal Representation. IEEE Trans. ASSP, 34(4), 744–754.
Serra, X. & Smith, J. O. (1990). Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic Decomposition. Computer Music Journal, 14(4), 12–24.
Abe, M. & Smith, J. O. (2004). Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation. CCRMA Technical Report STAN-M-117.
de Cheveigné, A. & Kawahara, H. (2002). YIN, a Fundamental Frequency Estimator for Speech and Music. J. Acoust. Soc. Am., 111(4), 1917–1930.
Noll, A. M. (1967). Cepstrum Pitch Determination. J. Acoust. Soc. Am., 41(2), 293–309.
Kodera, K., Gendrin, R. & de Villedary, C. (1976). A New Method for the Numerical Analysis of Nonstationary Signals. Phys. Earth Planet. Int., 12, 142–150.
Auger, F. & Flandrin, P. (1995). Improving the Readability of Time-Frequency and Time-Scale Representations by the Reassignment Method. IEEE Trans. Signal Processing, 43(5), 1068–1089.
Rabiner, L. R., Schafer, R. W. & Rader, C. M. (1969). The Chirp z-Transform Algorithm. IEEE Trans. Audio Electroacoust., 17(2), 86–92.
Bluestein, L. I. (1970). A Linear Filtering Approach to the Computation of the Discrete Fourier Transform. IEEE Trans. Audio Electroacoust., 18(4), 451–455.
Atal, B. S. & Schroeder, M. R. (1979). Predictive Coding of Speech Signals and Subjective Error Criteria. IEEE Trans. ASSP, 27(3), 247–254.
Itakura, F. & Saito, S. (1970). A Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies. Electronics and Communications in Japan, 53-A, 36–43.
Markel, J. D. & Gray, A. H. (1976). Linear Prediction of Speech. Springer-Verlag.
Levinson, N. (1947). The Wiener RMS Error Criterion in Filter Design and Prediction. J. Math. Phys., 25, 261–278.
Durbin, J. (1960). The Fitting of Time-Series Models. Rev. Int. Statist. Inst., 28, 233–244.
Flanagan, J. L. & Golden, R. M. (1966). Phase Vocoder. Bell System Technical Journal, 45(9), 1493–1509.
Laroche, J. & Dolson, M. (1999). Improved Phase Vocoder Time-Scale Modification of Audio. IEEE Trans. SAP, 7(3), 323–332.
Griffin, D. W. & Lim, J. S. (1984). Signal Estimation from Modified Short-Time Fourier Transform. IEEE Trans. ASSP, 32(2), 236–243.
Röbel, A. & Rodet, X. (2005). Efficient Spectral Envelope Estimation and its Application to Pitch Shifting and Envelope Preservation. Proc. DAFx, 30–35.
Purnhagen, H. & Meine, N. (2000). HILN—The MPEG-4 Parametric Audio Coding Tools. Proc. ISCAS.
Engel, J. et al. (2020). DDSP: Differentiable Digital Signal Processing. Proc. ICLR 2020.
Fitz, K. & Haken, L. (2002). On the Use of Time–Frequency Reassignment in Additive Sound Modeling. J. Audio Eng. Soc., 50(11), 879–893.
Kong, J. et al. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS 2020.
van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.