The Constant-Q Transform

A visual, interactive guide to the frequency transform that hears music the way we do — with logarithmic resolution matching the structure of pitch.

Intuition FFT vs CQT Log-Frequency Geometry The CQT Kernel Computing the CQT Chromagram Applications

▼

Chapter 1

01 The Core Intuition

"Constant Q" means the ratio of each frequency bin's centre frequency to its bandwidth is constant. Low notes get wide analysis windows (good frequency resolution); high notes get narrow windows (good time resolution).

This mirrors how musical pitch works. On a piano, the distance from C3 to C4 spans the same perceptual interval as C5 to C6, even though the latter covers four times the Hz range. The CQT spaces its bins logarithmically to match, typically placing 12, 24, 36, or more bins per octave.

Bins per octave: 24 Octaves: 5

CQT frequency bins — each bar is one bin. Notice how bandwidth grows with frequency, but Q = f/Δf stays constant.

Q factor: Q = f_k / Δf_k. In a standard FFT, Q varies — low-frequency bins are narrow (high Q), high-frequency bins are wide (low Q). In the CQT, Q is the same for every bin. This is why musical notes look equally spaced in a CQT spectrogram.

Chapter 2

02 FFT vs CQT

The FFT uses linearly-spaced frequency bins. Musical notes are logarithmically spaced. The mismatch makes pitch analysis with the FFT awkward — the CQT solves this directly.

Consider a signal containing notes at C3, E4, and G5. In the FFT, these three notes are unevenly distributed across the spectrum, and the low notes are smeared across very few bins. In the CQT, each note occupies proportionally the same width.

Three-note chord: C3 (130.8 Hz) + E4 (329.6 Hz) + G5 (784 Hz)

FFT — linear frequency axis. Low notes are crammed into a tiny region.

CQT — log frequency axis. Each note gets equal visual weight.

The fundamental difference: FFT bin spacing is Δf = f_s/N (constant in Hz). CQT bin spacing is Δf_k = f_k · (2^1/B − 1) (proportional to frequency). This makes the CQT a natural fit for anything involving musical pitch, speech formants, or logarithmic frequency structure.

Chapter 3

03 Log-Frequency Geometry

The CQT maps frequency to a logarithmic axis where octaves are equally spaced. This aligns perfectly with the Western chromatic scale — and with human pitch perception.

Animated piano-roll CQT spectrogram — watch notes light up on a log-frequency axis as the arpeggio plays

Bins per octave (B): With B=12, each bin is one semitone. B=24 gives quarter-tone resolution. B=36 gives eighth-tones. Higher B means finer pitch discrimination but longer analysis windows at low frequencies.

f_k = f_min · 2^k/B for k = 0, 1, …, K−1 where K = B · num_octaves

Chapter 4

04 The CQT Kernel

Each CQT bin uses a different-length analysis window. Low-frequency bins use long windows for precise pitch resolution; high-frequency bins use short windows for precise timing.

The window length for bin k is N_k = Q · f_s / f_k. This is the heart of the constant-Q property — the window always contains the same number of oscillation cycles, regardless of frequency.

N_k = ⌈ Q · f_s / f_k ⌉ where Q = 1 / (2^1/B − 1)

Bin index: 0

CQT analysis kernel for the selected bin — real part (top) and magnitude envelope (bottom). Drag the slider to see how window length shrinks at higher frequencies.

Compute the window length N_k for each bin from Q and f_k

Generate a windowed complex exponential at f_k

Multiply the signal by this kernel and sum — that's one CQT coefficient

Repeat for all K bins across all time frames

Chapter 5

05 Computing the CQT

The naive CQT is expensive — O(KN) per frame. Efficient algorithms use the FFT as a backbone, either via spectral kernels or recursive downsampling.

Speed:

Building a CQT spectrogram frame by frame — each column is one hop, each row is one log-frequency bin

Efficient CQT (Brown & Puckette, 1992): Pre-compute spectral kernels K̂_k via FFT. For each frame, take the FFT of the signal segment X̂, then the CQT coefficient is simply X̂ · K̂_k*. This converts the whole transform into sparse matrix–vector multiplications, often 10–100× faster than the naive approach.

CQT(k, n) = Σ_j x(n + j) · w_k(j) · e^{−i2πf_kj/f_s}

Where w_k(j) is the window function of length N_k. The key insight is that each bin sees a different number of signal samples, unlike the FFT where every bin uses the same window.

Chapter 6

06 From CQT to Chromagram

Fold the CQT's log-frequency bins into a single octave and you get a chromagram — a 12-dimensional representation of harmonic content, perfect for chord recognition and key detection.

If the CQT has B=36 bins per octave across 5 octaves (180 bins total), the chromagram sums every 3rd bin (for 12 chroma classes) across all octaves, collapsing the 180 bins into 12.

Chord:

CQT spectrogram (top) folded into a 12-bin chromagram (bottom). Switch chords to see how the chroma profile changes.

Octave invariance: The chromagram treats C3, C4, and C5 as the same pitch class "C". This makes it robust to register changes and lets algorithms focus on harmonic identity rather than absolute pitch.

Chapter 7

07 Real-World Applications

The CQT is the backbone of modern music information retrieval, audio synthesis, and intelligent audio processing.

Signal:

Live CQT spectrogram of different musical signals — note how pitched content appears as distinct horizontal bands

🎵

Pitch tracking — automatic transcription, melody extraction, tuner apps

🎸

Chord recognition — real-time chord detection via CQT → chromagram pipeline

🎛

Audio synthesis — phase vocoders with perceptually uniform frequency resolution

🤖

Deep learning — CQT spectrograms as input features for music classification, source separation, and generative models

CQT in neural networks: Many state-of-the-art music AI systems (Demucs, OpenL3, CREPE) use CQT or CQT-like representations as input features instead of mel spectrograms, because the constant-Q resolution matches the structure of the data they're modelling.