How Whisper Transcribes 57 Languages at Once

by

in
HubAI Asia
HubAI AsiaCompare & Review the Best AI Tools

Whisper hallucinates words that were never spoken—and that single flaw reveals more about how OpenAI’s speech model actually works than any marketing page ever will. Whisper doesn’t “listen” to audio the way humans do. It chops sound into mathematical fragments, compresses them through convolutional filters, and then plays a high-stakes game of predictive text across 57 languages simultaneously. The phantom words are a side effect of the same architecture that gives Whisper its remarkable accuracy. Here’s the step-by-step breakdown of what actually happens inside the model from the moment audio enters to the moment text appears.

Key Facts Most People Don’t Know

  • Whisper was trained on 680,000 hours of multilingual audio data scraped from the web, equivalent to 77.5 years of continuous listening
  • The model uses 99 language tokens but can actually transcribe 57 languages with word error rates below 50%, with English achieving just 5.4% WER
  • Whisper’s largest model contains 1,550 million parameters across 32 transformer decoder blocks, making it 15x larger than the tiny model’s 39 million parameters

This isn’t a surface-level overview. We’re going through the actual signal processing, the encoder architecture, the decoder’s special token system, and the beam search that decides which words make it to your screen. If you’ve ever wondered why Whisper sometimes invents text during silence—or why it switches languages mid-sentence—the answers are built into these steps.

Step 1: Audio Is Resampled to 16,000 Hz and Cut Into 30-Second Windows

Regardless of what format your audio arrives in—44,100 Hz stereo WAV, 22,050 Hz MP3, or 8,000 Hz phone-call recording—Whisper’s first action is ruthlessly standardizing it. Every audio stream is downmixed to mono and resampled to exactly 16,000 Hz. This isn’t arbitrary: 16 kHz captures frequencies up to 8,000 Hz (by the Nyquist theorem), which covers the entire range of human speech. Anything above that is wasted computation.

After resampling, the audio is padded or truncated into 30-second windows. Whisper processes audio in fixed 30-second chunks, and each chunk is treated as an independent transcription unit. If your clip is shorter than 30 seconds, it gets zero-padded (silence appended). If it’s longer, the audio is split into overlapping segments—30 seconds each, with 10-second overlaps between adjacent windows. That overlap is critical: it ensures words at segment boundaries aren’t cut in half and lost.

This chunked approach is also why Whisper can run on modest hardware. Instead of loading a 2-hour podcast into GPU memory all at once, it processes one 30-second slice at a time, stitching the results together at the end.

Step 2: Log-Mel Spectrogram Converts Audio Into a 2D Grid

The raw waveform—a 1D array of amplitude values—means almost nothing to a neural network. Whisper needs a representation that exposes the frequency content of speech over time, so it converts each 30-second window into a log-Mel spectrogram.

Here’s what that involves. A Short-Time Fourier Transform (STFT) is applied using 25-millisecond windows with a 10-millisecond stride (hop length). This produces a complex-valued spectrogram, from which the magnitude is extracted and then mapped onto the Mel scale—a perceptual frequency scale that mirrors how human hearing works (we’re better at distinguishing low frequencies than high ones). The Mel filterbank uses 80 channels, meaning each time step is represented by 80 frequency-bin values.

The math works out to exactly 3,000 time steps per 30-second window (30,000 ms ÷ 10 ms stride = 3,000 frames). Each frame has 80 frequency values, so the spectrogram is a 3,000 × 80 matrix. This matrix is then log-scaled (log of the Mel energies plus a small epsilon to avoid log(0)), producing the final input tensor. The log compression mirrors the logarithmic nature of human loudness perception and keeps the dynamic range manageable for the neural network.

Step 3: A Two-Layer CNN Compresses the Spectrogram by 2×

Before the transformer sees anything, a compact two-layer 1D convolutional neural network processes the spectrogram along the time dimension. Each layer uses a kernel size of 3, a stride of 2, and GELU activation (Gaussian Error Linear Unit—a smoother alternative to ReLU that’s become standard in transformer architectures).

The stride-2 convolution at each layer downsamples the time axis by a factor of 2. After two layers, the 3,000 time steps become 1,500 positional embeddings. The channel dimension expands from 80 input channels to the model’s hidden dimension—512 for the tiny model, up to 1,280 for the large-v3 model.

This 2× compression serves a practical purpose: it reduces the sequence length that the transformer’s self-attention mechanism must process, cutting memory consumption and computation by roughly 4× (attention is quadratic in sequence length). The CNN learns to merge adjacent time frames intelligently, preserving the most salient acoustic features while discarding redundancy.

Step 4: Sinusoidal Positional Embeddings and Encoder Self-Attention

After the CNN, sinusoidal positional embeddings are added to the 1,500 token representations. These fixed (non-learned) embeddings encode each token’s position in the sequence using sine and cosine functions at different frequencies, giving the model a sense of temporal order without requiring additional trainable parameters.

The encoded representations then pass through the encoder’s multi-head self-attention blocks. Each block applies:

  • Layer normalization before the self-attention sublayer (Pre-LN architecture, which stabilizes training)
  • Multi-head self-attention—each token attends to all 1,499 other tokens, learning relationships between different time positions in the audio
  • Layer normalization before the feed-forward sublayer
  • A position-wise feed-forward network (two linear transformations with GELU activation in between)

The large model stacks 32 of these encoder blocks, each with 20 attention heads. The tiny model uses just 4 blocks with 6 heads. The encoder’s job is to transform the acoustic features into a rich hidden representation that encodes not just what sounds are present, but their linguistic relationships—phonemes, prosody, speaker identity, and ambient noise characteristics all get baked into these hidden states.

“Whisper was trained on 680,000 hours of multilingual audio data scraped from the web, equivalent to 77.5 years of continuous listening”

Step 5: The Decoder Receives Special Control Tokens

This is where Whisper diverges from a standard text-to-text transformer. The decoder doesn’t start generating text from a generic start token. Instead, it receives a carefully orchestrated sequence of special control tokens that configure its behavior:

  1. <|startoftranscript|> — signals the beginning of transcription
  2. <|language|> — one of 99 language tokens identifying the detected language (e.g., <|en|> for English, <|ja|> for Japanese, <|sw|> for Swahili)
  3. <|task|> — either <|transcribe|> (output in the same language) or <|translate|> (output in English regardless of input language)
  4. <|notimestamps|> or timestamp tokens — determines whether the output includes per-word timing information

These tokens act like a configuration panel for the model. The language token is particularly clever: Whisper performs automatic language detection as part of its decoding process, sampling the first few tokens to identify the language before committing to a transcription path. This is why the model uses 99 language tokens but only achieves usable WER on 57 of them—the remaining 42 represent low-resource languages where training data was too sparse for reliable performance.

The task token is what enables Whisper’s translation capability without any separate translation model. OpenAI deliberately included 117,000 hours of X→en translation data in the training set—audio in one language paired with English text. This teaches the decoder to perform direct speech-to-text translation, converting non-English speech directly into English text without any intermediate transcription step.

Step 6: Cross-Attention Bridges Encoder and Decoder

While the decoder generates text token by token, it needs to stay grounded in the actual audio. That’s the job of cross-attention layers. Each decoder block contains both self-attention (attending to previously generated text tokens) and cross-attention (querying the encoder’s final hidden states).

The cross-attention mechanism works like this: the decoder’s query vectors come from the partially generated text, while the key and value vectors come from the encoder’s output. At each generation step, the decoder essentially asks, “Given what I’ve written so far and what the audio contains, what should the next word be?”

Critically, the decoder uses causal masking in its self-attention layers—it can only attend to tokens it has already generated, never future tokens. This prevents the model from “cheating” by looking ahead at its own output. The cross-attention, however, has full access to the encoder’s representation, allowing the decoder to look at any part of the 30-second audio window at any time.

This asymmetry is what makes the architecture work: the encoder sees the entire audio segment at once (no masking), while the decoder generates left-to-right, autoregressively, using cross-attention to stay anchored in the acoustic signal.

Step 7: Beam Search Evaluates 5 Parallel Hypotheses

Whisper doesn’t just pick the most likely next token and move on. It uses beam search with a beam width of 5, maintaining five competing transcription hypotheses simultaneously. At each step, the model generates candidate tokens for all five beams, scores them, and keeps the top 5 continuations.

The scoring function isn’t just raw log probability. Whisper applies a length normalization penalty that divides the total log probability by the number of tokens generated. Without this, beam search would systematically favor shorter transcriptions (fewer tokens = fewer opportunities for the probability to decrease). The normalization ensures that a fluent 12-word hypothesis isn’t beaten by a truncated 5-word one just because it’s shorter.

After all tokens are generated (or the <|endoftranscript|> token is reached), the beam with the highest normalized score becomes the final output. This beam search process is what gives Whisper its characteristic coherence—single-token greedy decoding often produces locally plausible but globally inconsistent text, while beam search finds sequences that are globally optimal under the model’s probability distribution.

The 10-second overlap between adjacent 30-second windows becomes important during stitching. When two windows produce overlapping text, Whisper needs to reconcile them. In practice, the second window’s output typically overwrites the tail of the first, since it has more acoustic context to work with. This can occasionally cause text to shift or duplicate at boundaries—one of the known limitations of the chunked approach.

Step 8: Decoding With Repetition Suppression

The final decoding step uses temperature sampling set to 0—which is equivalent to greedy selection from the beam’s top candidate. But Whisper adds a critical safety mechanism: repetition suppression.

When the model’s confidence drops below a threshold (measured by the average log probability of recently generated tokens), it’s likely entering a degenerate loop—repeating the same phrase over and over, a well-known failure mode of autoregressive language models. Whisper detects this by monitoring the compression ratio of the output (how much the text can be compressed, which spikes during repetition) and the average log probability.

When a repetition loop is detected, Whisper takes corrective action: it increases the temperature (introducing randomness to break out of the loop) and applies a repetition penalty that reduces the probability of tokens that have already appeared. If the loop persists across multiple attempts, the model may fall back to a simpler transcription or even produce an empty result for that segment rather than output garbled repetition.

This is directly connected to the hallucination problem. When the audio contains silence or very low-amplitude noise, the encoder produces weak, ambiguous representations. The decoder, starved of clear acoustic signal, defaults to generating text that looks linguistically plausible but has no connection to the actual audio. The repetition suppressor catches the most extreme cases (looping), but subtle hallucinations—plausible-sounding sentences that were never spoken—can slip through because they don’t trigger the repetition detector.

The Five Model Sizes and Their Tradeoffs

Whisper comes in five sizes, each with dramatically different capabilities and resource requirements:

  • Tiny: 39M parameters, ~32× real-time speed, ~15% English WER—usable for real-time captioning on mobile
  • Base: 74M parameters, ~16× real-time, ~10% English WER—good balance for embedded devices
  • Small: 244M parameters, ~6× real-time, ~7% English WER—the practical minimum for most applications
  • Medium: 769M parameters, ~2× real-time, ~6% English WER—solid accuracy, still GPU-friendly
  • Large (v3): 1,550M parameters, ~1× real-time, ~5.4% English WER—maximum accuracy, requires significant GPU memory

The large model’s 1,550 million parameters are spread across 32 transformer decoder blocks, making it 15× larger than the tiny model. The biggest accuracy gains come from tiny→small (7.6% WER improvement) while large→medium adds only ~0.6%. For most production use cases, the small or medium model hits the sweet spot between accuracy and speed.

Why This Pipeline Matters

Understanding Whisper’s architecture isn’t just academic—it explains every quirk you’ve probably encountered. The 30-second windowing explains why transcriptions sometimes shift at boundaries. The log-Mel spectrogram explains why Whisper struggles with very high-pitched sounds (the Mel scale compresses high frequencies). The special token system explains why you can force-translation by prepending language tokens. The beam search explains why Whisper’s output is unusually coherent for a speech model. And the repetition suppression explains both why Whisper rarely loops—and why it sometimes hallucinates instead of looping.

The model isn’t “listening” in any human sense. It’s performing a precisely engineered sequence of signal transformations, attention computations, and probabilistic text generation. The fact that this pipeline achieves 5.4% WER on English—approaching human-level performance on many benchmarks—is a testament to how far the encoder-decoder architecture can go when trained on enough data.

But what triggers those phantom words Whisper invents when audio goes silent?

💡 Sponsored: Need fast hosting for WordPress, Node.js, or Python? Try Hostinger → (Affiliate link — we may earn a commission)

📬 Get AI Tool Reviews in Your Inbox

Weekly digest of the best new AI tools. No spam, unsubscribe anytime.

🎁

Built by us: Exit Pop Pro

Turn your WordPress visitors into email subscribers with an exit-intent popup that gives away a free PDF. $29 one-time — no monthly fees, no SaaS lock-in.

Get it →
📺 YouTube📘 Facebook