An interactive guide to Julius O. Smith III's book on parametric spectrum
modelling — sinusoidal tracking, spectral envelopes, pitch, reassignment,
the chirp-z transform, and linear prediction. Written to sit above
the plain STFT and below neural spectrum modelling.
Julius O. Smith IIIMcAulay & QuatieriSerra & SmithAuger & Flandrinde Cheveigné & Kawahara
Julius O. Smith's third online book, Spectral Audio Signal Processing
(SASP), is not really about the DFT. The DFT is covered in his first book,
the STFT in his second. SASP is about everything that follows a spectrum:
how to read it, how to parameterise it, how to track its peaks across time, how
to model its envelope, how to re-sharpen it after a window has smeared it,
and how to synthesise sound from the numbers that come out the other end.
This companion walks through SASP chapter by chapter, with live visualisations
of every important construction. Each demo is computed in your browser with a
hand-written radix-2 FFT and the Web Audio API — no servers, no
dependencies except KaTeX for the mathematics.
What SASP adds beyond the earlier books
Smith's four online books form a tower. Mathematics of the DFT (MDFT)
builds the bare transform. Introduction to Digital Filters (Filters)
builds the LTI machinery. Physical Audio Signal Processing (Physical)
builds strings, tubes, rooms. Spectral Audio Signal Processing (SASP)
lives in between — it is the book that answers the question
"now that I have a spectrum, what do I do with it?".
Book
Lives at
Answers
MDFT
Signal → transform
What is a DFT?
Filters
Transform → LTI design
How do I shape sound?
SASP
Transform → parameters
What is in this spectrum?
Physical
Parameters → sound
How do strings and tubes sound?
What you should already know
If you have worked through the
STFT for Sound
companion, you are ready. This guide does not re-derive windowing, COLA, or
the phase-vocoder phase update — those sit in the STFT guide. Here we
stand above that layer: we assume you already have a spectrogram and we ask
what to do with it.
The hero canvas above is not decoration — it is the key image of this
book. Each bright dot is a sinusoid with a frequency (vertical axis), a time
(horizontal axis), and an amplitude (brightness). The trails behind the dots
are the tracks that the McAulay–Quatieri algorithm will draw
in chapter 4. Every major chapter of SASP is an answer to a question about
those dots.
Cross-links to siblings
This companion is one of a family of interactive audio-DSP guides. Links that
matter for what follows:
The STFT turns a signal into a grid of complex numbers. For parametric
modelling we care about peaks in the log-magnitude of each frame,
because peaks are where sinusoids live. This chapter is a refresher with a
specifically parametric eye; the mechanics of windowing and hop sizing are
covered in full in the
STFT_for_Sound
companion.
The single-sinusoid picture
When you compute the DFT of a windowed pure tone, you do not see a delta
function — you see the window's Fourier transform shifted to the
tone's frequency. Every window has a characteristic mainlobe width (which
controls the minimum resolvable frequency spacing), and a characteristic
sidelobe level (which controls how much a strong sinusoid can mask a weak one
nearby).
When the two copies of $W$ do not overlap appreciably, the positive-frequency
peak is a clean image of $W$. Parametric methods exploit that: they fit a
known mainlobe shape to the few bins around the peak and read off the
parameters $(A, \omega_0, \phi_0)$ directly.
Interactive: window library with parametric readouts
···
Left: window in time. Right: zero-padded FFT magnitude in dB. Readout reports main-lobe 6 dB width, highest sidelobe, and scalloping loss — the three numbers that decide whether a window is good for parametric extraction.
Zero-padding and the interpolation illusion
Zero-padding the time signal before the FFT does not add information; it
interpolates the existing transform. But interpolation is exactly what we
want for peak estimation — it lets the DFT samples land closer to the
true peak. A factor of 4 or 8 is usually plenty, because after that the
parabolic interpolation trick (next chapter) is more efficient than brute
zero-padding.
The important lesson: a "good window for parametric extraction" is one
whose mainlobe is close to a parabola in dB near the peak. Hann, Hamming,
and Blackman all qualify; Rectangular does not.
03Quadratic Peak Interpolation
You have a DFT frame. A sinusoid is in there somewhere near bin $k^\star$,
the bin of the highest magnitude. But the true frequency is almost certainly
between bins — you need a sub-bin estimate. The standard
industrial trick, taught in SASP and used in every serious sinusoidal
analyser, is a three-point parabolic fit in dB around the peak.
The formula
Let $a = 20\log_{10}|X[k^\star{-}1]|$, $b = 20\log_{10}|X[k^\star]|$, and
$c = 20\log_{10}|X[k^\star{+}1]|$. Fit a parabola through those three dB
values and find its maximum. The offset from the centre bin is
$$\hat b \;=\; b \;-\; \tfrac{1}{4}(a-c)\,\delta.$$
The true frequency estimate is $\hat\omega = \omega_{k^\star} + \delta \cdot (2\pi/N)$
radians/sample, and the true amplitude estimate is $10^{\hat b / 20}$ after
correcting for the window gain.
Why dB, not linear, not squared?
A parabola is a quadratic; it fits any smooth local maximum to second
order in a Taylor expansion. The question is which vertical axis makes
the second-order approximation most accurate. Abe and Smith showed (2004)
that for the usual windows — Hann, Hamming, Blackman, Gaussian
— the mainlobe shape on a dB axis is very close to parabolic; the
residual second-order error is tiny. On a linear or squared axis the
mainlobe is much less parabolic, so a linear parabolic fit leaves
systematic bias of several percent of a bin.
The Gaussian window is special: its mainlobe is exactly a
parabola on a dB axis, so parabolic interpolation is exact (apart from
the sidelobe-overlap bias below). This is one reason Serra's SMS analyser
uses a Gaussian-like window.
Abe and Smith's bias
The estimator above is biased for two reasons, both discussed in detail in
SASP. First, the mainlobe on a dB axis is only approximately
parabolic; the residual third-derivative shape introduces a small bias that
depends on the offset. Second, the negative-frequency image of the same
sinusoid leaks a little energy into the positive-frequency peak — more
for short windows, less for long ones — which skews the three-point
estimate slightly towards DC.
For a Hann window of length $L$ at a tone of digital frequency $\omega_0$,
the bias is roughly
for a small constant $C$ that depends on the window. In practice the bias
is below $10^{-3}$ bins for $L \ge 1024$ and any audio frequency. Parabolic
interpolation is overwhelmingly the right tool.
Interactive: click any peak to see the estimate
···
Blue stems: DFT magnitude in dB. Orange curve: parabola fitted through three highest bins. Dashed vertical: true frequency. Dotted vertical: raw-bin estimate. Solid orange: parabolic estimate.
Move the frequency slider slowly and watch the raw-bin estimate jump by a
whole bin at a time while the parabolic estimate tracks continuously. This
is why every sinusoidal analyser from McAulay–Quatieri onward uses
quadratic interpolation rather than arg-max alone.
Phase interpolation
Amplitude and frequency are not the only parameters you can read from a
mainlobe; the phase at the true peak is recoverable too. With the same
three-bin fit, the interpolated phase is obtained by evaluating the linear
fit of the two complex bin values at the interpolated frequency
— effectively
with the usual care about phase unwrapping. This is the third parameter
$(A,\,\omega,\,\phi)$ that sinusoidal modelling needs per peak per frame.
04Sinusoidal Modelling — McAulay & Quatieri
In 1986 Robert J. McAulay and Thomas F. Quatieri, working at MIT Lincoln
Laboratory on a military speech coder, published the paper that started a
branch of audio DSP. Speech Analysis/Synthesis Based on a Sinusoidal
Representation (IEEE ASSP 34(4), 744–754) proposes that any
voiced speech signal can be well approximated by a short, rapidly
time-varying sum of sinusoids:
The parameters $\{A_p(n), \phi_p(n)\}$ evolve smoothly in time along
tracks. A track is born at some frame, lives for a while, and
dies when it can no longer be matched. The analysis stage turns a signal
into a list of tracks; the synthesis stage turns a list of tracks back
into a signal.
The analysis pipeline
Take a windowed DFT at frame $m$ (hop $H$).
Pick the local maxima of the log-magnitude spectrum.
For each peak, use quadratic interpolation to estimate $(A_p, \omega_p, \phi_p)$.
Match peaks in frame $m+1$ to tracks alive at frame $m$.
Unmatched tracks die; unmatched peaks give birth to new tracks.
The matching rule
McAulay and Quatieri match greedily by frequency proximity, with a threshold
$\Delta$ in Hz. Given the frequencies $\{\omega_i^{(m)}\}$ at frame $m$ and
$\{\omega_j^{(m+1)}\}$ at frame $m+1$, a track $i$ is continued by the
frame-$m{+}1$ peak $j$ that minimises $|\omega_i - \omega_j|$ subject to
$|\omega_i - \omega_j| < \Delta$. Peaks inside $\Delta$ without a free
partner are birthed; tracks with no match within $\Delta$ are killed.
The threshold $\Delta$ is usually a few Hz per ms of hop; the exact number is
less important than the fact that one exists. The original MQ paper uses
about $\Delta \approx 0.5$ semitones, i.e.\ about 3–5% of the
frequency value.
The synthesis stage — phase-continuous oscillator interpolation
Between adjacent frames a track has a known frequency, amplitude, and phase
at each end. The synthesiser must produce samples that match those two
end-point conditions and keep the phase continuous, i.e.\ the derivative
of phase should be smooth across the frame boundary. McAulay and Quatieri
chose a cubic phase polynomial: amplitude is linearly interpolated
between frames, phase is a cubic whose value and derivative at the two ends
match $\phi_p$ and $\omega_p$:
with $(\alpha,\beta)$ chosen uniquely so that $\tilde\phi(H) \equiv \phi_{m+1} \pmod{2\pi}$
and $\tilde\phi'(H) = \omega_{m+1}$. The phase-unwrap step that resolves the
integer ambiguity in $\phi_{m+1}$ is the MQ analogue of the phase-vocoder
princarg update; SASP gives the closed form.
Interactive: MQ track visualiser
The flagship demo. Pick a signal and watch the peak-picking and the track
continuation. Each track is a coloured polyline in the time-frequency plane;
the line breaks where the track dies. Toggle the match threshold and see
tracks merge, fragment, or disappear entirely. Switch to the audio panel
and resynthesise from the tracks — with cubic-phase interpolation the
resynthesis is indistinguishable from the original for stationary voiced
signals; with piecewise-linear phase you hear characteristic artefacts.
···
Top: spectrogram. Bottom: MQ tracks in the time-frequency plane — each colour is one track. Change Δ to see birth/death behaviour shift.
Birth/death in MQ tracking is the visual echo of the hero canvas at the top
of this page. Each bright dot up there is a peak in one frame; the trail it
leaves is a track.
05Spectrum Modelling with Residuals — SMS
Pure sinusoidal modelling works beautifully for stationary pitched sounds, but it
has nothing to say about breath, bow noise, hammer thuds, cymbal hash, or the
hiss of a held $/s/$. In 1990 Xavier Serra — then Smith's PhD student at
CCRMA — published Spectral Modeling Synthesis: A Sound
Analysis/Synthesis System Based on a Deterministic plus Stochastic
Decomposition, extending MQ by explicitly separating signals into a
sinusoidal deterministic part and a noise-like stochastic
part.
The deterministic part is exactly McAulay–Quatieri: track-based
sinusoidal resynthesis. The residual $e(n) = s(n) - \hat s(n)$ in the time
domain is then analysed separately. Serra observes that for musical signals
$e(n)$ is very well modelled as filtered noise: a coloured random process
whose short-time spectral envelope can be sampled at the same frame rate as
the sinusoids. Synthesis becomes deterministic-cosine-bank plus
filtered-white-noise convolution.
The residual envelope can be estimated by smoothing $|E(m,k)|$ along $k$
with a moving average, or equivalently by fitting a low-order cepstral or
LPC envelope to the residual's log-magnitude. SMS uses line-segment
approximations; more recent systems use LPC or the true envelope from the
next chapter.
Interactive: deterministic vs stochastic
···
Top: original spectrogram. Middle: deterministic-only spectrogram (bank-of-sinusoids). Bottom: residual spectrogram.
The ratio of sinusoidal energy to residual energy is a powerful diagnostic
for musical timbre. A pure flute sits near 100% deterministic; a cymbal sits
near 0%; a voiced vowel with breath sits at roughly 70–80%. SMS was
one of the first analysis systems that could measure this ratio
continuously in time, and it influenced every later CASA system.
06Spectral Envelope Estimation
A harmonic sound has a fine structure (the comb of harmonics at
multiples of $f_0$) and an envelope (the smooth curve that the
harmonic tips sit on). The envelope encodes the timbre — the vocal
tract filter, the body of an instrument, the resonance of a room. Three
classical methods for estimating it are covered in SASP.
Cepstral liftering
The cepstrum is the IDFT of the log-magnitude spectrum. Fine structure has
rapid variation across frequency — high-quefrency cepstrum —
while the envelope has slow variation — low-quefrency cepstrum.
"Liftering" (filtering in cepstrum domain) keeps only the lowest $Q$
quefrency bins and discards the rest.
The cepstral envelope is very cheap, but it tends to under-shoot formant
peaks because harmonic peaks and valleys average out equally.
True envelope (Röbel & Rodet)
Iteratively take the cepstral envelope, then replace any spectral value
below the envelope by the envelope itself; repeat. This "cepstral
discriminator" converges to the envelope that touches the harmonic peaks
from above — the true envelope. Röbel and Rodet (2005)
proved convergence; the method is now a standard feature in tools like
SuperVP, World, and phase-vocoder pitch shifters that need to preserve
formants.
Linear prediction (LPC)
A different route. Assume the signal is the output of an all-pole filter
driven by white noise (or a pulse train). Find the filter coefficients
$\{a_1,\dots,a_p\}$ that minimise the one-step prediction error
The squared filter response $|1/A(e^{j\omega})|^2$ is then the LPC
spectrum, and its peaks are the formants. LPC is historically the workhorse
of speech coding (Atal & Schroeder at Bell Labs, Itakura & Saito at
NTT, Markel & Gray), and it is still used inside most modern vocoders.
The Levinson–Durbin recursion that solves for the $a_i$ in
$\mathcal{O}(p^2)$ appears in chapter 10.
Interactive: LPC order slider on a vowel
···
Blue: DFT magnitude (dB). Orange: LPC envelope. Green: cepstral envelope. Red ticks: formant peaks detected on the LPC envelope.
07Pitch Estimation
Pitch is one of the oldest problems in audio DSP. SASP covers the three
main families: autocorrelation in the time domain, cepstrum in the
quefrency domain, and the modern difference-function family (YIN). Each
family decides which samples of the signal look most like copies of each
other at lag $T_0$.
A periodic signal at period $T_0$ has a peak at $\tau = T_0$. Find the
highest peak away from $\tau = 0$. Cheap and effective for clean pitched
signals; fails on octave confusion when the signal has weak fundamental or
strong partial.
Cepstrum
The cepstrum of a voiced sound has a spike at quefrency $T_0$ — the
period of the harmonic comb in the log-magnitude spectrum. Noll proposed it
in 1967 and it remained the standard speech pitch tracker into the 1990s.
Picks the right octave more reliably than autocorrelation.
YIN — de Cheveigné & Kawahara (2002)
The pitch tracker inside most modern audio software. YIN defines the
difference function
which is zero at $\tau = T_0$ for a perfect periodic signal. For real
signals it has a deep local minimum there. The second YIN insight is the
cumulative mean normalised difference function
which starts at 1 and dips well below 1 at true periods. Pick the smallest
$\tau$ for which $d'(\tau)$ falls below an absolute threshold (typically
$0.1$), and parabolically interpolate around it. YIN's robustness against
noise and octave errors made it the de-facto standard, and it underlies the
pitch trackers in Ardour, Ableton Live's Melodyne-like tools, Audacity's
pitch-detect plugin, and the librosa.pyin default.
Interactive: YIN on a pitched signal with noise
···
Top: raw difference function d(τ). Bottom: cumulative normalised d'(τ) with threshold line. Orange vertical: detected period.
08Time–Frequency Reassignment
A spectrogram cell is wide in time and wide in frequency — the
Heisenberg–Gabor box. The energy that falls inside that cell is
not uniformly distributed across the cell; for a single sinusoid,
all of it is concentrated at a single point. Reassignment is a standard
post-processing trick that moves each spectrogram pixel to the centre of
gravity of its energy, producing a far sharper picture without changing the
underlying FFTs.
Kodera, Gendrin, de Villedary (1976)
Kodera and colleagues proposed the principle in a 1976 paper on geophysical
signal analysis: move each spectrogram value from its grid cell
$(t_m, \omega_k)$ to the reassigned position
Patrick Flandrin and his student François Auger gave the closed-form
discrete version, which is cheap to compute. Given the STFT with window
$w$, you also compute two "helper" STFTs: one with window $t\,w(t)$ (time
ramp), and one with window $w'(t)$ (window derivative). The reassignment
coordinates are
For a single chirp, reassignment turns the smeared spectrogram into an
almost pixel-perfect line, and for a bell it turns the smeared striae into
tight clusters at the partials. The cost is a loss of interpretability for
mixed regions — where a sinusoid and an impulse overlap, reassigned
pixels can go in both directions.
Interactive: reassigned spectrogram on chirp + bell
Top: ordinary spectrogram. Bottom: reassigned spectrogram computed with the Auger-Flandrin formulas using three STFTs.
09The Chirp-Z Transform
The DFT samples $H(z)$ on the unit circle at $N$ equally spaced angles. The
chirp-z transform (Rader, Schafer & Rabiner 1969, building on Rader
1968) samples $H(z)$ on an arbitrary spiral starting at any point
and stepping by any complex ratio. This lets you zoom into a narrow
frequency band of an FIR filter, or evaluate the filter's response on a
near-unit-circle contour to inspect pole damping — things the plain
FFT cannot do without enormous zero-padding.
where $A = A_0 e^{j\theta_0}$ is the start point and $W = W_0 e^{j\phi_0}$
is the step. For $A_0 = 1$, $\theta_0 = 0$, $W_0 = 1$, $\phi_0 = 2\pi/N$
you recover the DFT.
Why it matters
Bluestein (1970) showed that $X_k$ above can be computed by three FFTs of
length $\ge N + M - 1$. That makes the chirp-z transform practical, and it
is how modern software evaluates a single narrow band of a spectrum at
arbitrary resolution — think of the high-resolution zooming on a DAW
spectrum analyser around a harmonic of a kick drum.
Interactive: pick a spiral, see the zoomed spectrum
···
Left: the z-plane with unit circle and the chosen spiral contour. Right: full DFT of the signal in dB (blue) with the CZT-zoomed slice (orange) plotted on top. Move the span slider to the left to see the chirp-z's fine resolution over the narrow band.
10Linear Prediction — Levinson and the Lossless Tube
Linear prediction is such a central tool in audio DSP that SASP devotes a
full chapter to it separately from the envelope-estimation use in chapter 6.
Three ideas interlock: the Yule–Walker normal equations, the
$\mathcal{O}(p^2)$ Levinson–Durbin recursion, and the lossless
acoustic-tube interpretation of the reflection coefficients.
The normal equations
Minimising $E = \sum e(n)^2$ by setting $\partial E/\partial a_i = 0$ gives
the Toeplitz system
the new predictor coefficients are $a_j^{(i)} = a_j^{(i-1)} + k_i\,a_{i-j}^{(i-1)}$,
and the prediction error power updates as
$$E_i \;=\; E_{i-1}\,(1 - k_i^{\,2}).$$
Stability: if all $|k_i| < 1$ then $A(z) = 1 + \sum a_i z^{-i}$ has all
zeros inside the unit circle, so $1/A(z)$ is stable. The algorithm is
Leroy Levinson's doctoral recursion (1947), rediscovered in the signal
processing context by J. P. Burg, J. Durbin (1960), and put
to work in speech by B. S. Atal and L. R. Rabiner.
The lossless tube interpretation
The reflection coefficients $\{k_i\}$ have a beautiful physical meaning:
they are literally the boundary reflection coefficients of a lossless
acoustic tube made of $p+1$ uniform cylindrical sections. A positive $k_i$
means the tube widens, a negative $k_i$ means it narrows. This insight,
due to Kelly and Lochbaum in the 1960s (and developed by Markel and Gray),
is why LPC analysis and articulatory vocal-tract modelling converged:
$\{k_i\}$ from speech analysis give you a tube, and the tube reproduces the
speech. The cross-link to Physical Audio Signal Processing is
direct.
Interactive: Levinson–Durbin
···
Left: reflection coefficients k_i bar chart (green: |k|<1 stable). Right: prediction-error power E_i as a function of order (log scale) — a "knee" indicates that more coefficients stop helping.
11Phase-Vocoder Variants
The phase vocoder — Flanagan & Golden's 1966 Bell Labs idea to
treat each STFT bin as an amplitude-modulated, frequency-modulated carrier
— lives inside almost every audio time-stretcher and pitch-shifter.
The mechanics of the phase-unwrap update belong in the STFT companion,
which derives them in full; here we survey the variants SASP focuses on.
Flanagan & Golden (1966)
The original. Each analysis bin carries a magnitude and an instantaneous
frequency; each synthesis bin reconstructs a sinusoid from those. In its
plain form it suffers from "phasiness": sinusoids that straddle two bins
(which is most of them) lose their strict phase coupling when each bin's
phase is advanced independently.
Laroche & Dolson identity phase locking (1999)
Jean Laroche and Mark Dolson — at IRCAM and E-Mu respectively —
fixed phasiness with a small change. Detect the local magnitude peaks in
every frame; for each peak, advance its phase by the usual phase-vocoder
update; but for the two neighbours on each side, copy the peak's phase
increment so the relative phase across the three-bin mainlobe is
preserved. The result sounds dramatically cleaner.
If you have only a magnitude spectrogram — say from a neural model
— the Griffin–Lim alternating projection is the classical way
to invent a plausible phase. It is covered in detail in the
STFT for Sound
companion. Modern neural vocoders (WaveNet, HiFi-GAN, WaveGlow) beat
Griffin–Lim on perceived quality, but it remains the dependency-free
baseline.
Sinusoidal-model phase vocoder
SASP's distinctive contribution is the observation that the plain
phase-vocoder phase-update equations are exactly what fall out of
sinusoidal modelling when you identify each bin with a single track. The
MQ synthesiser is a phase vocoder, one oscillator per track, with
the bonus that birth/death resolve explicitly. A sinusoidal-model
time-stretch stretches tracks, not bins, and does not suffer from phasiness
at all. Modern tools like SPEAR, Loris, and SMSTools implement this
approach.
The methods in chapters 3–11 are the building blocks for four broad
application families in audio. Each is an active area of research and each
runs on the same few primitives: pick peaks, track them, estimate envelopes,
separate, resynthesise.
Additive resynthesis with track editing
The earliest application of MQ/SMS analysis. Edit the tracks — move
them in pitch, stretch them in time, remove some — and resynthesise.
SPEAR, Loris, and SMSTools all implement this. The interactive SMS demo in
chapter 5 is a miniature version: the "deterministic" button is exactly
additive resynthesis from the retained peaks.
Pitch-shifting that preserves timbre
Plain resampling shifts pitch and formants together, producing the
"chipmunk" effect. A sinusoidal-model pitch shift scales the frequencies
but keeps the amplitudes driven by the spectral envelope of the original
— so when the pitch goes up, the formants stay put. This is how
Melodyne, Auto-Tune, Ableton's Complex Pro, and Logic's Flex Pitch all
work under the hood. The envelope is usually a cepstral or true envelope,
sampled at the new partial frequencies.
Sinusoidal-model denoising
If a signal is known to be mostly a small number of sinusoids — a
solo violin, a sustained vowel — you can denoise aggressively by
resynthesising only the tracks that live long enough and are loud enough.
Noise shows up as short, faint peaks that never link into tracks; they
are discarded by the track-length and amplitude thresholds. This approach
is the ancestor of modern non-negative matrix factorisation denoisers.
Adaptive spectral whitening
Invert the estimated spectral envelope (LPC, cepstral, or true) to get a
whitening filter, apply it, and you have a flattened residual that is
much easier to analyse for onsets, pitch, or chord detection. Adaptive
whitening is baked into almost every modern music-information-retrieval
pipeline.
Differentiable spectral modelling (DDSP)
Engel, Hantrakul, Gu, and Roberts (2020) showed that sinusoidal plus
filtered-noise decomposition can be made differentiable and
trained end-to-end with a neural controller, producing the
Differentiable Digital Signal Processing framework. DDSP is the
direct neural descendant of SMS: sinusoids plus stochastic residual, with
the parameters emitted by a neural net rather than hand-tuned. See the
DDSP_Differentiable_DSP
companion for an interactive treatment.
13Legacy and Continuing Impact
Smith's SASP is the Rosetta Stone between classical spectrum analysis and
today's neural-audio world. Its vocabulary — tracks, deterministic
plus stochastic, reassignment, cepstral envelope, reflection coefficients
— is the vocabulary of every modern spectral audio system, even the
ones whose parameters are learned instead of estimated.
Software descendants
SPEAR (Klingbeil, 2005): sinusoidal-track editor with
MQ analysis and additive resynthesis; educational and still widely
used.
SMSTools (Serra and the UPF Music Technology Group):
the canonical SMS implementation; teaches the sinusoids-plus-residual
decomposition.
Loris (Fitz & Haken, 2002): partials tracking
with bandwidth-enhanced partials — SMS with a noise bandwidth
per partial instead of a separate residual.
World (Morise, 2016): fast speech analysis with
DIO pitch, CheapTrick envelope, and PLATINUM aperiodicity — all
built on SASP primitives.
SuperVP (IRCAM): the engine behind AudioSculpt, with
true-envelope, phase-vocoder time-stretch, and track editing.
Coding standards
MPEG-4 HILN (Harmonic and Individual Lines plus Noise,
Purnhagen & Meine 2000) codes low-bitrate audio with an explicit
sinusoids-plus-noise model — SMS in a standards document. Even where
MDCT has won the codec wars (AAC, Opus), residual noise shaping borrows the
SMS stochastic-part idea.
Neural spectral vocoders
WaveNet (van den Oord 2016), Parallel WaveNet (2017), WaveRNN (2018),
MelGAN (2019), HiFi-GAN (2020), BigVGAN (2022) all map log-mel
spectrograms to waveforms with neural networks. None of them explicitly
uses MQ tracks or LPC, but the input representation they consume
— log-mel spectrogram — is exactly what SASP's chapter on the
STFT gives you. Their output quality rises in step with the fidelity of
their handling of the sinusoidal part of the spectrum, which is
still the hardest thing for a neural net to reproduce cleanly.
Differentiable spectral modelling
DDSP (Engel et al. 2020) closed the circle: sinusoidal-plus-noise
synthesis implemented with differentiable operations, parameters emitted
by a neural controller, trained end-to-end. The result is orders of
magnitude more parameter-efficient than raw neural vocoding and inherits
the interpretability of SMS. As of 2025, DDSP-style hybrid models are
behind several commercial synths and are the core of the Magenta project.
McAulay, R. J. & Quatieri, T. F. (1986). Speech Analysis/Synthesis Based on a Sinusoidal Representation. IEEE Trans. ASSP, 34(4), 744–754.
Serra, X. & Smith, J. O. (1990). Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic Decomposition. Computer Music Journal, 14(4), 12–24.
Abe, M. & Smith, J. O. (2004). Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation. CCRMA Technical Report STAN-M-117.
de Cheveigné, A. & Kawahara, H. (2002). YIN, a Fundamental Frequency Estimator for Speech and Music. J. Acoust. Soc. Am., 111(4), 1917–1930.
Noll, A. M. (1967). Cepstrum Pitch Determination. J. Acoust. Soc. Am., 41(2), 293–309.
Kodera, K., Gendrin, R. & de Villedary, C. (1976). A New Method for the Numerical Analysis of Nonstationary Signals. Phys. Earth Planet. Int., 12, 142–150.
Auger, F. & Flandrin, P. (1995). Improving the Readability of Time-Frequency and Time-Scale Representations by the Reassignment Method. IEEE Trans. Signal Processing, 43(5), 1068–1089.
Rabiner, L. R., Schafer, R. W. & Rader, C. M. (1969). The Chirp z-Transform Algorithm. IEEE Trans. Audio Electroacoust., 17(2), 86–92.
Bluestein, L. I. (1970). A Linear Filtering Approach to the Computation of the Discrete Fourier Transform. IEEE Trans. Audio Electroacoust., 18(4), 451–455.
Atal, B. S. & Schroeder, M. R. (1979). Predictive Coding of Speech Signals and Subjective Error Criteria. IEEE Trans. ASSP, 27(3), 247–254.
Itakura, F. & Saito, S. (1970). A Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies. Electronics and Communications in Japan, 53-A, 36–43.
Markel, J. D. & Gray, A. H. (1976). Linear Prediction of Speech. Springer-Verlag.
Levinson, N. (1947). The Wiener RMS Error Criterion in Filter Design and Prediction. J. Math. Phys., 25, 261–278.
Durbin, J. (1960). The Fitting of Time-Series Models. Rev. Int. Statist. Inst., 28, 233–244.
Flanagan, J. L. & Golden, R. M. (1966). Phase Vocoder. Bell System Technical Journal, 45(9), 1493–1509.
Laroche, J. & Dolson, M. (1999). Improved Phase Vocoder Time-Scale Modification of Audio. IEEE Trans. SAP, 7(3), 323–332.
Griffin, D. W. & Lim, J. S. (1984). Signal Estimation from Modified Short-Time Fourier Transform. IEEE Trans. ASSP, 32(2), 236–243.
Röbel, A. & Rodet, X. (2005). Efficient Spectral Envelope Estimation and its Application to Pitch Shifting and Envelope Preservation. Proc. DAFx, 30–35.
Purnhagen, H. & Meine, N. (2000). HILN—The MPEG-4 Parametric Audio Coding Tools. Proc. ISCAS.
Engel, J. et al. (2020). DDSP: Differentiable Digital Signal Processing. Proc. ICLR 2020.
Fitz, K. & Haken, L. (2002). On the Use of Time–Frequency Reassignment in Additive Sound Modeling. J. Audio Eng. Soc., 50(11), 879–893.
Kong, J. et al. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS 2020.
van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.