Chapter 21 — 🔊 Speech & Audio Processing

📖 All chapters | ← 20 · 💬 Natural Language Processing | 22 · ⏳ Time Series & Forecasting →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Speech and audio processing is the branch of applied AI that turns sound — the messy, continuous pressure waves hitting a microphone — into something a machine can recognize, generate, or reason about. It sits at the intersection of classical signal processing and deep learning, and it powers voice assistants, dictation, podcast transcripts, music apps, and accessibility tools. This chapter follows the full arc: how audio is represented as a signal and a sequence, how machines learn to transcribe it (ASR), how they learn to speak (TTS), how they tell who is speaking, how they classify and search non-speech audio, how they learn audio representations without labels, how they compress sound into discrete tokens, and the stubborn real-world challenges of noise, accents, and latency.

🧭 In context: Applied AI built on sequence models · used for transcription, voice synthesis, speaker ID, and audio search · the one key idea: audio becomes a 2-D time-frequency image (the spectrogram), and from there it is a sequence-modeling problem.

💡 Remember this: turn sound into a log-mel-spectrogram and almost every audio task — recognition, synthesis, speaker ID, classification — becomes a sequence-modeling problem you already know how to solve.

21.1 — Audio as a signal and a sequence

Intuition first. Imagine sound as ripples on a pond. A microphone is a tiny cork bobbing up and down on those ripples, and thousands of times a second it writes down how high it sits. That column of numbers is the audio. The rest of this chapter is about turning that raw bobbing record into something a machine can read at a glance — and the trick is to stop staring at the height-over-time wiggle and instead ask “which musical notes are mixed into this moment?”

A microphone measures air pressure many thousands of times per second. Each measurement is a sample, and the number of samples per second is the sampling rate (16,000 Hz for speech, 44,100 Hz for music). String the samples together and you get a waveform: a 1-D array of numbers, the rawest possible representation of sound.

The waveform is faithful but unwieldy. One second of 16 kHz audio is 16,000 numbers, and almost none of them mean anything on their own — what matters is patterns over time. The Nyquist theorem tells us a sampling rate of $f_s$ can faithfully capture frequencies up to $f_s/2$; that is why 16 kHz speech keeps everything up to 8 kHz, which covers the human voice, and why music at 44.1 kHz reaches past 20 kHz, the edge of human hearing.

\[f_{\max} = \frac{f_s}{2}\]

In words: the highest frequency you can faithfully record is half your sampling rate; anything faster than that gets scrambled (aliased) into a fake lower tone. Also written: $f_s \ge 2\,f_{\max}$ — to capture a tone of frequency $f_{\max}$ you must sample at least twice as fast as it oscillates.

The key move in audio processing is switching from the time domain (amplitude vs. time) to the frequency domain (how much energy sits at each pitch). The tool is the Fourier transform, which decomposes a signal into a sum of sine waves (a linear-algebra change of basis). Applied to short overlapping windows of the waveform — the Short-Time Fourier Transform (STFT) — it produces a spectrogram: a 2-D image where the x-axis is time, the y-axis is frequency, and brightness is energy. We use short windows (typically 25 ms, hopping 10 ms) because speech is non-stationary — the frequencies change as the mouth moves — so we cut the signal into slices short enough to be roughly steady within each one.

This animation shows the single most important move in the whole chapter: a window slides along the waveform, and each slice it covers becomes one vertical stripe of the spectrogram on the right.

The discrete Fourier transform that powers each STFT window is, for $N$ samples:

\[X_k = \sum_{n=0}^{N-1} x_n \, e^{-\,i\,2\pi k n / N}, \qquad k = 0,\dots,N-1\]

In words: to find how much of frequency-bin $k$ lives in the window, slide a sine/cosine wave of that frequency across the samples, multiply point-by-point, and add it all up; a big result means that pitch is strongly present. Also written: $X_k = \sum_n x_n\big(\cos(2\pi k n/N) - i\sin(2\pi k n/N)\big)$ — the same sum split into its real (cosine) and imaginary (sine) parts, so $|X_k|^2$ is the energy at bin $k$.

Two refinements make the spectrogram match human hearing. First, we don’t perceive pitch linearly: the gap from 100→200 Hz sounds huge, but 5000→5100 Hz is barely noticeable. The mel scale warps the frequency axis to match this perception, giving a mel-spectrogram. The standard mapping is:

\[m = 2595 \,\log_{10}\!\left(1 + \frac{f}{700}\right)\]

In words: convert a frequency in hertz into “perceived pitch” units that are stretched out in the low range (where our ears are picky) and squeezed in the high range (where they are not). Also written: $m = 1127\,\ln\!\left(1 + \frac{f}{700}\right)$ — the identical curve using the natural log instead of base-10 (since $2595/\ln(10)\approx 1127$).

A quick worked feel for that warp: plug in $f = 200$ Hz and you get $m \approx 283$ mels; plug in $f = 5000$ Hz and you get $m \approx 2363$ mels. So a 25× jump in raw hertz is only about an 8× jump in mels — exactly the squeezing of the high end that makes the mel axis line up with what your ear actually notices.

Second, we hear loudness logarithmically, so energies are converted to decibels (a log). Finally, MFCCs (Mel-Frequency Cepstral Coefficients) apply one more transform (a discrete cosine transform) to the log-mel energies, compressing each time frame into ~13 decorrelated numbers — the workhorse feature of classical speech systems for decades.

The pipeline, drawn:

Here is the core transform from scratch — a single mel-spectrogram frame, no libraries beyond numpy:

import numpy as np
# tiny worked example: one 25 ms window of 16 kHz audio = 400 samples
fs, win = 16000, 400
t = np.arange(win) / fs
# a vowel-like signal: 200 Hz pitch + a 700 Hz formant
sig = np.sin(2*np.pi*200*t) + 0.5*np.sin(2*np.pi*700*t)
sig *= np.hanning(win)                 # taper edges to reduce spectral leakage
spectrum = np.abs(np.fft.rfft(sig))**2 # power at each frequency bin
freqs = np.fft.rfftfreq(win, 1/fs)
# crude mel filterbank: 3 triangular bands
def tri(f, lo, hi):
    c = (lo+hi)/2
    return np.clip(1 - np.abs(f-c)/(c-lo), 0, 1)
mel = [ (spectrum * tri(freqs, *b)).sum() for b in [(50,300),(300,900),(900,2000)] ]
print("log-mel energies:", np.round(np.log(np.array(mel)+1e-6), 2))
# the 200 Hz + 700 Hz energy lands in the first two bands, as expected

In practice nobody hand-rolls this — a one-liner from a real library gives you the production feature. With torchaudio:

import torchaudio, torch
wav, sr = torchaudio.load("speech.wav")          # (channels, samples)
melspec = torchaudio.transforms.MelSpectrogram(
    sample_rate=sr, n_fft=400, hop_length=160, n_mels=80
)(wav)                                            # (channels, 80 mels, frames)
logmel = torchaudio.transforms.AmplitudeToDB()(melspec)  # to decibels (log)
print(logmel.shape)   # the 80 x T "image" that most audio models consume

Tip

Rule of thumb: almost every modern audio model eats a log-mel-spectrogram, not a raw waveform or MFCCs. MFCCs were designed to decorrelate features for Gaussian models; neural nets don’t need that, and the extra DCT just throws away information they could use.

21.2 — Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) maps audio to text. The hard part is that speech has no spaces: the sound stream is continuous, speakers talk at different speeds, and the same word stretches over a variable number of audio frames. The history of ASR is the story of how the field handled this alignment problem.

The classical pipeline (pre-2015). Old systems factored the problem with Bayes’ rule. To find the most likely word sequence $W$ given acoustics $X$:

\[\hat{W} = \arg\max_W \; \underbrace{P(X \mid W)}_{\text{acoustic model}} \; \underbrace{P(W)}_{\text{language model}}\]

In words: out of every possible sentence, pick the one that both sounds like the recorded audio (acoustic model) and reads like plausible language (language model); multiply those two scores and take the winner. Also written: $\hat{W} = \arg\max_W P(W \mid X)$, since by Bayes’ rule $P(W\mid X) \propto P(X\mid W)\,P(W)$ — the $P(X)$ denominator is the same for every $W$, so it drops out of the $\arg\max$.

The acoustic model (a Hidden Markov Model with Gaussian mixtures, later a neural net) scored how well audio matched phonemes — the atomic sounds of a language. A pronunciation lexicon mapped phonemes to words. The language model (an n-gram) scored how likely a word sequence was as English. A decoder searched the combined space. It worked, but it needed three separately trained components, a hand-built lexicon, and forced alignments between audio frames and phonemes.

End-to-end CTC. Connectionist Temporal Classification (CTC) collapsed all of that into one neural network trained directly on (audio, text) pairs — no per-frame alignment needed. The trick: the network emits a character (or a special blank token) at every audio frame, then a collapsing rule removes blanks and merges adjacent repeats. The CTC loss sums the probability over all frame-level paths that collapse to the target text.

\[P(Y \mid X) = \sum_{\pi \,\in\, \mathcal{B}^{-1}(Y)} \prod_{t=1}^{T} p_t(\pi_t \mid X)\]

In words: the probability of the target text $Y$ is the total probability of every frame-by-frame path $\pi$ that collapses down to $Y$, where each path’s probability is just the product of the per-frame symbol probabilities. Also written: $\mathcal{L}_{\text{CTC}} = -\log P(Y \mid X)$ — training minimizes the negative log of that summed-over-alignments probability ($\mathcal{B}$ is the collapse function; $\mathcal{B}^{-1}(Y)$ is all paths that map to $Y$).

A worked example makes the collapsing rule concrete. Target word: CAT. Suppose the network sees 6 frames and emits one of these alignments (_ is blank):

Frame-level output	Collapse repeats	Remove blanks	Result
`C C _ A T T`	`C _ A T`	`C A T`	✓ CAT
`_ C A A _ T`	`_ C A _ T`	`C A T`	✓ CAT
`C A A T T T`	`C A T`	`C A T`	✓ CAT

The blank token is essential: without it, the double-l in HELLO would collapse to a single L. By inserting a blank between the two L’s, the model can keep them apart. CTC sums probability across every valid alignment, so the model never has to be told which frame is which letter — it learns alignment for free. Its one weakness: CTC assumes outputs are conditionally independent given the audio, so it leans on an external language model for fluent text.

Attention / sequence-to-sequence. A second end-to-end family uses an encoder–decoder with attention (see Attention & Transformers): the encoder turns audio into a sequence of vectors, and the decoder generates text one token at a time, attending to the relevant audio frames at each step. This learns a soft alignment jointly with the language, often beating CTC on fluency — at the cost of needing the whole utterance before decoding, which is bad for streaming.

Whisper-style models. Modern systems like Whisper are large Transformer encoder–decoders trained on hundreds of thousands of hours of weakly-labeled, multilingual web audio. The scale buys robustness: they transcribe accents, background music, and many languages out of the box, and a single model also does translation and language identification. The lesson mirrors NLP — at sufficient scale, one general model beats a stack of specialized components.

In practice, running a state-of-the-art ASR model is now a few lines via Hugging Face:

from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="openai/whisper-small")
text = asr("meeting.wav")["text"]          # multilingual, robust to noise
print(text)
# add return_timestamps=True for word/segment times; pass language= to force one

flowchart TD
  A[Audio waveform] --> B[Log-mel spectrogram]
  B --> C{Modeling approach}
  C -->|"Classical"| D["Acoustic HMM-GMM<br/>+ lexicon + n-gram LM<br/>+ decoder"]
  C -->|"End-to-end CTC"| E["Encoder to per-frame chars<br/>collapse blanks/repeats"]
  C -->|"Seq2seq / Whisper"| F["Encoder-decoder + attention<br/>autoregressive text"]
  D --> G[Transcript]
  E --> G
  F --> G

Warning

Common mistake: judging ASR by accuracy on clean read speech. The metric that matters is Word Error Rate (WER) — $(\text{substitutions}+\text{insertions}+\text{deletions})/\text{words}$ — measured on your real audio: spontaneous, noisy, accented, domain-specific. A model at 4% WER on audiobooks can hit 25% on a noisy call-center recording.

The WER formula, spelled out, since it is the number reported in every ASR paper:

\[\text{WER} = \frac{S + I + D}{N}\]

In words: count how many words the system swapped (substitutions), added (insertions), and dropped (deletions) compared to the human reference, then divide by the number of words actually spoken. Also written: $\text{WER} = 1 - \dfrac{N - S - I - D}{N}$ — one minus the fraction of reference words that survived untouched (the edit distance between hypothesis and reference, normalized by reference length).

Worked example: the reference is “the cat sat on the mat” (6 words). The system outputs “the cat sat the mat” — it dropped “on” (1 deletion) and is otherwise correct. So $S=0,\,I=0,\,D=1,\,N=6$, giving $\text{WER} = 1/6 \approx 0.167$, or 16.7%.

21.3 — Text-to-Speech (TTS)

Text-to-Speech (TTS), or speech synthesis, is the inverse problem: turn text into a natural-sounding waveform. It splits into a front end (text → linguistic features: how to pronounce numbers, abbreviations, where to put stress and pauses) and a back end (linguistic features → audio). The front end alone is harder than it looks — “Dr.” is Doctor on one line and Drive on the next, and “$1.50” must become “one dollar and fifty cents.” The back end is where the technology evolved most.

Concatenative TTS (1990s–2000s). Record one speaker saying many sentences, chop the audio into tiny units (phones, diphones), and at synthesis time stitch the best-matching units together. The output sounds like the real human — because it is — but only for sounds in the database; joins can be audible, and you cannot change the voice’s emotion or speed without re-recording. It is essentially a giant copy-paste of real speech.

Parametric TTS (2000s–2010s). Instead of storing audio, train a statistical model (an HMM, later a neural net) to predict acoustic parameters — pitch, spectral envelope, duration — which a vocoder turns into sound. It has a tiny footprint and is fully controllable (change pitch, speed, voice), but the vocoder makes it sound buzzy and muffled — the classic “robot” voice.

Neural TTS (2016→). Deep learning collapsed the quality gap. The modern recipe is two stages:

An acoustic model like Tacotron 2 — a sequence-to-sequence attention network — maps text (or phonemes) to a mel-spectrogram, learning prosody and rhythm end-to-end.
A neural vocoder like WaveNet converts that mel-spectrogram into a raw waveform. WaveNet was the breakthrough: an autoregressive model that predicts each audio sample from the previous ones, producing audio nearly indistinguishable from human speech.

WaveNet’s autoregressive idea, in one formula, is worth pinning down because it reappears everywhere from PixelCNN to language models:

\[p(\mathbf{x}) = \prod_{t=1}^{T} p\!\left(x_t \mid x_1, x_2, \dots, x_{t-1}\right)\]

In words: the probability of a whole waveform is built one sample at a time — each new audio sample is predicted from all the samples that came before it. Also written: $p(\mathbf{x}) = \prod_t p(x_t \mid x_{<t})$, where $x_{<t}$ is shorthand for the entire history before step $t$ — which is exactly why naive generation is sequential and slow.

This little animation makes the “one sample at a time” idea tangible: each new sample (the pulsing dot) is born only after looking back at the samples already laid down to its left.

flowchart LR
  T["Text: Dr. Smith"] --> FE["Front end:<br/>normalize + phonemize"]
  FE --> AM["Acoustic model<br/>Tacotron 2<br/>to mel-spectrogram"]
  AM --> V["Neural vocoder<br/>WaveNet / HiFi-GAN<br/>to waveform"]
  V --> W["Audio out"]

The catch with WaveNet was speed: predicting 24,000 samples per second one at a time is painfully slow. Later vocoders — parallel WaveNet, HiFi-GAN (a GAN-based vocoder) — generate all samples at once, hitting real-time or faster while keeping quality high. Today, end-to-end systems such as VITS fold both stages into a single network.

Era	How it makes sound	Quality	Flexibility	Footprint
Concatenative	Stitch recorded clips	High (in-domain)	Very low	Huge (audio DB)
Parametric	Vocoder from predicted params	Low (buzzy)	High	Tiny
Neural	Mel-net + neural vocoder	Human-level	High	Large (GPU)

A complete neural-TTS call is now also a few lines — here a HiFi-GAN-backed pipeline from Hugging Face:

from transformers import pipeline
import scipy.io.wavfile
tts = pipeline("text-to-speech", model="suno/bark-small")  # text -> waveform
out = tts("Dr. Smith will see you at 1:50.")
scipy.io.wavfile.write("hello.wav", rate=out["sampling_rate"],
                       data=out["audio"].squeeze())

Tip

Intuition: think of neural TTS as ASR run backwards. ASR is audio → spectrogram → text; TTS is text → spectrogram → audio. The mel-spectrogram is the shared “interlingua” in the middle of both pipelines.

21.4 — Speaker identification & diarization

So far we asked what was said. Now we ask who said it. The key idea is the speaker embedding: a fixed-length vector (a d-vector or x-vector) that captures the unique timbre of a voice, learned so that two clips from the same person land close together and clips from different people land far apart — the same metric-learning idea as face recognition.

Three related tasks build on this:

Speaker verification (1-to-1): “Is this the enrolled user?” Compare the new clip’s embedding to the enrolled one via cosine similarity, and accept if it is above a threshold. This is voice unlock.
Speaker identification (1-to-N): “Which of my N known speakers is this?” Pick the nearest enrolled embedding.
Speaker diarization: “Who spoke when?” Given a meeting recording with unknown speakers, segment it and label each segment — Speaker A, Speaker B… The classic approach slices audio into short windows, embeds each, then clusters the embeddings; each cluster becomes one speaker.

The comparison engine for all three is cosine similarity:

\[\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{a}\rVert \, \lVert \mathbf{b}\rVert}\]

In words: measure the angle between two voice vectors — point the same direction (score near 1) and it’s likely the same speaker; point in unrelated directions (score near 0) and it’s someone else, regardless of how loud either recording was. Also written: $\cos(\mathbf{a},\mathbf{b}) = \hat{\mathbf{a}} \cdot \hat{\mathbf{b}}$, the plain dot product of the two vectors after each is scaled to unit length — which is why only direction, not magnitude, matters.

Worked example of verification with embeddings:

import numpy as np
def cos(a, b): return a@b / (np.linalg.norm(a)*np.linalg.norm(b))
enrolled = np.array([0.9, 0.1, 0.4])   # x-vector of the real user
same     = np.array([0.8, 0.2, 0.5])   # same user, new recording
impostor = np.array([0.1, 0.9, 0.2])   # someone else
print(round(cos(enrolled, same), 2))      # 0.97 -> above 0.7 threshold -> ACCEPT
print(round(cos(enrolled, impostor), 2))  # 0.45 -> below threshold -> REJECT

A diarization pipeline ties it together:

flowchart LR
  A[Meeting audio] --> B["VAD:<br/>drop silence"]
  B --> C["Slice into<br/>1-2 s windows"]
  C --> D["Embed each window<br/>(x-vector)"]
  D --> E["Cluster embeddings"]
  E --> F["Label who spoke when<br/>Spk A: 0-4s, Spk B: 4-9s"]

The first stage, Voice Activity Detection (VAD), is worth naming on its own: a lightweight classifier that decides, frame by frame, whether speech is present at all. It trims silence before the expensive embedding step, and it is also what tells a phone “the caller stopped talking.”

The clustering step is easy to picture: every short window becomes a dot in embedding space, and dots from the same voice naturally pile up together. The diagram below shows two such piles — one cluster per speaker — which is exactly what “who spoke when” reduces to once the embeddings exist.

Here is a real x-vector embedding and verification using SpeechBrain, the de-facto open toolkit for speaker tasks:

from speechbrain.inference import SpeakerRecognition
verifier = SpeakerRecognition.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb")
score, accept = verifier.verify_files("enroll.wav", "test.wav")
print(float(score), bool(accept))   # cosine score + threshold decision

Warning

Common mistake: assuming the number of speakers is known. In real meetings it isn’t, and overlapping speech (two people talking at once) breaks the one-window-one-speaker assumption that simple clustering relies on — a leading source of diarization error.

21.5 — Audio classification, music information retrieval, and keyword spotting

Not all audio is speech. Audio classification assigns a label to a sound clip — dog bark, glass breaking, siren, baby crying. The pattern is identical to image classification: turn the clip into a mel-spectrogram “image” and run a CNN or an audio Transformer over it. This treats sound recognition as a vision problem on a time-frequency picture, which is why so much computer-vision machinery transfers directly.

Music Information Retrieval (MIR) is the audio-classification family aimed at music: genre and mood tagging, beat and tempo (BPM) tracking, chord recognition, instrument detection, and audio fingerprinting — the Shazam trick of matching a noisy phone clip to a track by hashing peaks in its spectrogram. The fingerprint is robust because peak locations survive noise and compression that would wreck the raw waveform. MIR powers recommendation, auto-tagging, and DJ tools.

Keyword spotting (KWS), or wake-word detection, is the always-listening task: a tiny model that runs continuously on-device, waiting for “Hey Siri” or “Alexa.” Its constraints are unique — it must be small enough to fit on a microcontroller, sip power, and produce almost no false accepts (waking when you didn’t call it) while keeping false rejects (missing a real call) low. Only after the wake word fires does the heavy cloud ASR kick in.

The two-stage idea is the whole game here: a cheap detector listens forever and stays quiet, then lights up the instant it hears the wake word and hands off to the expensive model.

A worked classification pass with a real pretrained audio model (the Audio Spectrogram Transformer):

from transformers import pipeline
clf = pipeline("audio-classification",
               model="MIT/ast-finetuned-audioset-10-10-0.4593")
print(clf("street.wav", top_k=3))
# e.g. [{'label': 'Siren', 'score': 0.61}, {'label': 'Vehicle', ...}, ...]

flowchart LR
  A[Mic stream] --> B["Tiny on-device<br/>KWS model<br/>always on, low power"]
  B -->|"wake word detected"| C["Wake: stream to<br/>full ASR + NLU"]
  B -->|"no wake word"| B

Tip

Why a two-stage design: running full ASR continuously would drain the battery and stream your whole day to the cloud. KWS is the cheap, private gatekeeper; the expensive model only runs about 1% of the time. The same tiered pattern appears all over efficient ML — a cheap filter in front of an expensive model.

21.6 — Self-supervised audio representation learning

Intuition first. Labeled speech is scarce and expensive — transcribing one hour of audio can take five hours of human effort — but unlabeled audio is nearly infinite (every podcast, audiobook, and YouTube video). Self-supervised learning is the trick of letting a model teach itself from that raw ocean of sound by playing a fill-in-the-blank game: hide part of the audio and make the model predict what was hidden. The model that gets good at this game has, as a side effect, learned what speech is — and then a tiny amount of labeled data is enough to point that knowledge at transcription, speaker ID, or emotion.

This is the audio version of what BERT did for text, and it is why a model like wav2vec 2.0 can reach strong WER with only ten minutes of labeled speech after pretraining on tens of thousands of unlabeled hours.

The two landmark recipes:

wav2vec 2.0 (Meta, 2020). A CNN encoder turns the raw waveform into latent frames; some frames are masked; a Transformer then must identify the correct latent for each masked spot from a small set of distractors (a contrastive task). It also learns a discrete “vocabulary” of sound units via product quantization as the prediction targets.
HuBERT (Meta, 2021). Simpler target: first cluster audio features with k-means to assign every frame a pseudo-label (a fake “phoneme ID”), then train the masked Transformer to classify the hidden frames into those cluster IDs — a BERT-style masked-prediction loss instead of a contrastive one.

The masked-prediction objective, in one line:

\[\mathcal{L} = -\sum_{t \in M} \log p\!\left(z_t \mid \tilde{\mathbf{X}}\right)\]

In words: only over the masked time steps $M$, ask the model to predict the correct hidden unit $z_t$ from the corrupted (partially masked) input $\tilde{\mathbf{X}}$, and penalize it when it’s wrong. Also written: $\mathcal{L} = -\frac{1}{|M|}\sum_{t\in M}\log p(z_t \mid \tilde{\mathbf X})$ — the same loss averaged over the masked frames; $z_t$ is a quantized latent (wav2vec 2.0) or a k-means cluster ID (HuBERT).

Where this shows up. This is not a lab curiosity: the speech pipeline behind many phone keyboards, call-center analytics, and low-resource-language transcription efforts starts from a wav2vec 2.0 or HuBERT checkpoint. A team building ASR for, say, Welsh or Swahili — languages with almost no transcribed corpora — can fine-tune a multilingual pretrained encoder on a few hours of labels and land near where an English-only system needed thousands of hours.

The payoff is the pretrain-then-finetune workflow that now dominates speech:

flowchart LR
  U["50k hours<br/>unlabeled audio"] --> P["Self-supervised<br/>pretrain (mask + predict)"]
  P --> R["Reusable speech<br/>encoder (wav2vec2 / HuBERT)"]
  R --> F1["+ tiny CTC head<br/>10 min labels -> ASR"]
  R --> F2["+ small head -> speaker ID,<br/>emotion, language"]

In code, the whole point is that the heavy lifting is already done — you load a pretrained encoder and fine-tune a small head:

import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# pretrained on 960h, fine-tuned for ASR — or load the base model and add your own head
proc  = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
inputs = proc(speech_array, sampling_rate=16000, return_tensors="pt")
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)        # greedy CTC decode
print(proc.batch_decode(pred_ids))

Tip

Why this changed the field: before self-supervision, every new language or domain needed its own large labeled corpus. Now one pretrained encoder transfers to dozens of downstream tasks with a fraction of the labels — the same “foundation model” shift that hit NLP and vision, arriving in audio.

21.7 — Neural audio codecs & discrete audio tokens

Intuition first. Text is naturally made of discrete tokens (words, sub-words), which is exactly what makes language models possible — a Transformer predicts the next token from a fixed vocabulary. Audio is continuous, so for years it couldn’t be fed to a language model the same way. A neural audio codec fixes this: it learns to chop sound into a stream of discrete audio tokens — a small alphabet of “sound words” — so that audio becomes just another sequence of tokens a Transformer can predict. This is the hidden engine under modern audio LLMs and text-to-audio models.

A neural codec is an autoencoder with a quantizer in the middle: an encoder compresses the waveform into a low-rate latent, a vector quantizer snaps each latent to the nearest entry in a learned codebook (turning it into an integer token), and a decoder reconstructs the waveform. The landmarks are SoundStream (Google) and EnCodec (Meta).

The leftover-error trick, in plain terms. One small codebook can’t describe sound precisely — it only has a handful of “sound words” to choose from, so its best guess is always a little off. Residual Vector Quantization (RVQ) fixes that by stacking guessers. The first codebook makes a rough guess. You subtract that guess from the true vector and you’re left with the error it missed. A second codebook then guesses that error; you subtract again, and a third codebook guesses what’s still left, and so on. Each layer is a small correction on top of the last, so a stack of tiny codebooks ends up as accurate as one impossibly huge codebook — and you store just the chosen index from each layer.

\[\mathbf{z} \approx \sum_{i=1}^{Q} \mathbf{e}_i, \qquad \mathbf{e}_i = \text{codebook}_i\big[\,k_i\,\big]\]

In words: the encoded vector is approximated by adding up $Q$ chosen codebook entries — the first captures the rough shape, each later one corrects the remaining error — and the integer indices $k_1,\dots,k_Q$ are the discrete audio tokens you actually store or feed to a model. Also written: define the residual $\mathbf{r}_1 = \mathbf{z}$ and iterate $k_i = \arg\min_k \lVert \mathbf{r}_i - \text{codebook}_i[k]\rVert$, $\;\mathbf{r}_{i+1} = \mathbf{r}_i - \text{codebook}_i[k_i]$ — successive refinement, the same idea as residual learning.

A tiny worked RVQ pass makes the “correct the leftover” idea concrete. Say the true latent is the single number $\mathbf{z} = 0.74$, and each codebook holds just two entries:

Codebook 1 = {0.0, 0.8}. Nearest to 0.74 is 0.8 (index 1). Residual: $0.74 - 0.8 = -0.06$.
Codebook 2 = {−0.1, +0.1}. Nearest to −0.06 is −0.1 (index 0). Residual: $-0.06 - (-0.1) = +0.04$.
Codebook 3 = {−0.03, +0.03}. Nearest to +0.04 is +0.03 (index 1). Residual now ≈ 0.01.

Reconstruction: $0.8 + (-0.1) + 0.03 = 0.73$, within 0.01 of the true 0.74 — using only three 1-bit choices. The stored tokens are just the indices (1, 0, 1).

Why this matters: once audio is a token stream, generation becomes a language-modeling problem. Models like AudioLM, VALL-E, and MusicGen generate speech or music by predicting these codec tokens autoregressively, then hand them to the codec’s decoder to render sound. The same codec also gives extreme compression (EnCodec hits a few kbit/s at quality that rivals classic codecs like Opus).

A real EnCodec round-trip is short:

from transformers import EncodecModel, AutoProcessor
import torch
model = EncodecModel.from_pretrained("facebook/encodec_24khz")
proc  = AutoProcessor.from_pretrained("facebook/encodec_24khz")
inputs = proc(raw_audio=wav, sampling_rate=24000, return_tensors="pt")
enc = model.encode(inputs["input_values"], inputs["padding_mask"])
codes = enc.audio_codes            # integer tokens: (batch, num_codebooks, T)
recon = model.decode(enc.audio_codes, enc.audio_scales,
                     inputs["padding_mask"])[0]   # back to a waveform
print(codes.shape)                 # these tokens are what an audio LLM predicts

Tip

The unifying picture: Section 21.6 made audio understandable without labels (self-supervised encoders); this section makes audio generatable as tokens (neural codecs). Together they are why “audio language models” exist — one turns sound into features, the other turns sound into a vocabulary, and a Transformer does the rest.

21.8 — Challenges: noise, accents, and latency

Lab demos are clean; the real world is not. Three challenges dominate deployed speech systems.

Noise and reverberation. Background chatter, traffic, music, and room echo (reverberation) all corrupt the signal. The main defenses are data augmentation — training on audio with synthetic noise and echo mixed in, plus SpecAugment, which masks random time and frequency bands of the spectrogram so the model can’t over-rely on any single cue — and front-end speech enhancement / denoising. The far-field problem (a speaker across a room, as with smart speakers) compounds this, and is attacked with microphone arrays and beamforming, which combine several mics to steer sensitivity toward the talker and suppress sound from other directions.

A common way to quantify how bad the noise is, and to set augmentation levels, is the signal-to-noise ratio:

\[\text{SNR}_{\text{dB}} = 10 \,\log_{10}\!\left(\frac{P_{\text{signal}}}{P_{\text{noise}}}\right)\]

In words: compare the power of the speech to the power of the background, on a decibel (log) scale — a big positive number means clean speech, near zero means the noise is as loud as the talker. Also written: $\text{SNR}_{\text{dB}} = 20\,\log_{10}\!\left(\dfrac{A_{\text{signal}}}{A_{\text{noise}}}\right)$ — the same ratio expressed with amplitudes instead of powers (since power $\propto$ amplitude$^2$, the 10 becomes 20).

Worked example: if the speech carries 100 units of power and the background carries 1, the ratio is 100, and $10\log_{10}(100) = 20$ dB — comfortably clean. Let the noise grow to match the speech (ratio 1) and you get $10\log_{10}(1) = 0$ dB — the talker and the background are equally loud, where WER starts falling off a cliff.

SpecAugment in practice is a one-liner you apply to the log-mel features during training:

import torchaudio.transforms as T
freq_mask = T.FrequencyMasking(freq_mask_param=27)   # hide horizontal bands
time_mask = T.TimeMasking(time_mask_param=100)        # hide vertical bands
aug = time_mask(freq_mask(logmel))   # forces the model to use redundant cues

Accents and dialects. Models inherit the bias of their training data. A system trained mostly on US English under-performs on Scottish, Indian, or Nigerian English, and on code-switching (mixing languages mid-sentence). The fix is data — deliberately diverse, multi-accent, multilingual corpora — and large weakly-supervised models like Whisper that have simply heard far more variety. This is also a fairness issue: a higher WER for some accent groups is a real, measured harm, not a rounding error.

Latency. For dictation you can wait for the full utterance; for a live assistant or live captions you cannot. Streaming ASR must emit words as the person speaks, which rules out models that need the whole clip (vanilla attention seq2seq) and favors CTC or the streaming Transducer (RNN-T) architecture, which decode left-to-right in real time. There is a genuine accuracy-vs-latency tradeoff: looking ahead a few hundred milliseconds gives the model more context and improves accuracy, but adds delay, so engineers tune that lookahead window per product.

flowchart TD
  N["Noise and reverb"] --> N1["SpecAugment + noise aug<br/>denoising, beamforming"]
  A["Accents and dialects"] --> A1["Diverse multilingual data<br/>large weak-supervision models"]
  L["Latency"] --> L1["Streaming CTC / RNN-T<br/>tune lookahead window"]

Warning

Common mistake: validating only on read, native-accent, close-mic speech, then being shocked by field performance. Build a test set that mirrors deployment — far-field, noisy, accented, spontaneous — before you ship, and track WER per accent and per noise condition, not just the average.

21.9 — Quick reference

Term / formula	What it means	When / why it matters
Sampling rate $f_s$	Samples captured per second (16 kHz speech, 44.1 kHz music)	Sets fidelity and data size; pick by the content’s frequency range
Nyquist $f_{\max}=f_s/2$	Highest faithfully recordable frequency	Sample $\ge 2\times$ the top frequency or it aliases into a fake tone
STFT → spectrogram	Fourier transform over short sliding windows → time×freq image	The central move; turns audio into a 2-D image for vision-style models
Mel scale $m=2595\log_{10}(1+f/700)$	Warps hertz into perceived-pitch units	Matches human hearing; gives the mel-spectrogram models consume
Log-mel-spectrogram	Mel energies in decibels (log loudness)	The standard input to almost every modern audio model
MFCCs	DCT of log-mel → ~13 decorrelated coeffs/frame	Classic feature for HMM/GMM systems; neural nets skip the DCT
CTC	Alignment-free loss summing over blank/repeat paths	End-to-end ASR with no per-frame labels; good for streaming
WER $=(S+I+D)/N$	Word error rate vs. a human reference	The ASR metric — always measure on realistic, noisy, accented audio
Tacotron 2 + vocoder	Text → mel-spectrogram → waveform	The neural-TTS recipe; mel-spec is the ASR/TTS interlingua
WaveNet $p(\mathbf{x})=\prod_t p(x_t\mid x_{<t})$	Autoregressive sample-by-sample waveform model	High fidelity but slow; HiFi-GAN/VITS parallelize for real time
Speaker embedding (x-vector)	Fixed vector capturing a voice’s timbre	Powers verification, identification, and diarization
Cosine similarity	Angle between two embeddings, magnitude-invariant	The accept/reject score for speaker verification
VAD	Frame-level speech-present classifier	Trims silence before costly embedding; detects end-of-turn
Keyword spotting (KWS)	Tiny always-on wake-word detector	Cheap on-device gatekeeper before heavy cloud ASR
wav2vec 2.0 / HuBERT	Self-supervised masked-prediction pretraining	One reusable encoder; fine-tune with minutes of labels
RVQ $\mathbf{z}\approx\sum_i \mathbf{e}_i$	Stacked codebooks each correcting the last’s residual	Turns audio into discrete tokens for codecs and audio LLMs
SNR$_{\text{dB}}=10\log_{10}(P_s/P_n)$	Speech power vs. noise power, in dB	Quantifies noise; WER falls off a cliff near 0 dB
SpecAugment	Mask random time/freq bands of the spectrogram	Cheap augmentation that boosts noise robustness
Streaming CTC / RNN-T	Decode left-to-right as audio arrives	Required for live captions; tune lookahead for accuracy vs. latency

21.10 — Key takeaways

Audio starts as a 1-D waveform; the field’s central move is the STFT → spectrogram → mel-spectrogram (log) → MFCCs chain that turns it into a 2-D time-frequency image. Modern models eat log-mel-spectrograms.
ASR evolved from a three-part pipeline (acoustic HMM + lexicon + LM) to end-to-end CTC (alignment-free via a blank token), attention seq2seq, and large weakly-supervised Whisper-style Transformers. Measure quality with WER on realistic audio.
TTS went concatenative → parametric → neural. The neural recipe is text → mel-spectrogram (Tacotron 2) → waveform (WaveNet / HiFi-GAN); the mel-spectrogram is the shared interlingua between ASR and TTS.
Speaker embeddings (x-vectors) power verification (1-to-1), identification (1-to-N), and diarization (who-spoke-when via clustering), with cosine similarity doing the comparison and VAD trimming silence first.
Non-speech audio — classification, MIR, keyword spotting — is largely a CNN/Transformer over a spectrogram; KWS is a tiny always-on gatekeeper in front of the heavy ASR stage.
Self-supervised learning (wav2vec 2.0, HuBERT) pretrains a reusable speech encoder on unlabeled audio via masked prediction, so downstream ASR/speaker/emotion tasks need only a fraction of the labels — the foundation-model shift, arrived in audio.
Neural audio codecs (SoundStream, EnCodec) with residual vector quantization turn continuous sound into discrete audio tokens, making audio generatable by a Transformer (AudioLM, VALL-E, MusicGen) and giving extreme compression.
The deployment challenges are noise/reverb (augmentation, SpecAugment, beamforming; gauged by SNR), accents (diverse data; a fairness concern), and latency (streaming CTC/RNN-T with a tunable lookahead).

21.11 — See also

Recurrent & Sequence Models — RNNs, LSTMs, and the sequence-modeling backbone of older ASR/TTS.
Attention & Transformers — the encoder–decoder and attention mechanism behind Whisper, Tacotron, and modern audio models.
Convolutional Neural Networks — the CNNs reused on spectrograms for audio classification.
Generative Models — the GAN, autoencoder, and autoregressive foundations behind neural vocoders (HiFi-GAN, WaveNet) and audio codecs (EnCodec).
Transfer & Self-Supervised Learning — the masked-prediction pretraining behind wav2vec 2.0 and HuBERT.
Natural Language Processing — the language models and tokenization that pair with ASR output, TTS input, and discrete audio tokens.
Clustering & Unsupervised Learning — the clustering step at the heart of speaker diarization (and HuBERT’s pseudo-labels).
Recommender Systems — where music-information-retrieval features feed recommendation.
AI Ethics, Fairness & Safety — the accent-bias and fairness dimension of deployed speech systems.

↪ The thread continues → Chapter 22 · ⏳ Time Series & Forecasting

Audio is one signal that unfolds in time; the broader discipline of forecasting the future from the past — demand, sensors, markets — is time-series analysis.

📖 All chapters | ← 20 · 💬 Natural Language Processing | 22 · ⏳ Time Series & Forecasting →