Your AI doesn’t pick the best word. Every time ChatGPT, Claude, or Gemini generates a response, it’s rolling a weighted die over tens of thousands of possible tokens — and the math behind that roll determines everything from creative flair to catastrophic hallucination. Understanding how token sampling works isn’t just academic curiosity; it’s the single most important lever developers have for controlling AI behavior in production.
- GPT-3’s default temperature of 0.7 means the model rejects roughly 30% of probability mass before sampling, while temperature 2.0 can make a 0.01% probable token 100x more likely to be selected
- Top-k sampling with k=50, introduced by Fan et al. in 2018, cuts off all but the 50 most probable tokens before sampling, but fails catastrophically when the true distribution has fewer than 50 viable options
- Nucleus sampling (top-p) with p=0.9, published by Holtzman et al. in February 2019, dynamically adjusts the candidate pool from as few as 1 token to over 1000 tokens depending on probability distribution shape
Most people think setting temperature to 0 makes an LLM “deterministic.” That’s only half true — and the other half involves repetition penalties, renormalization math, and a sampling trick called Gumbel-max that most developers have never heard of. Let’s break down the actual pipeline, step by step, the way it runs inside every major model serving framework from vLLM to TensorRT-LLM.
Step 1: Raw Logits — The Transformer’s Final Output
Before any sampling happens, the transformer’s final layer outputs raw logits — unnormalized scores — for every token in the model’s vocabulary. GPT-3’s vocabulary contains 50,257 tokens; GPT-4 and Claude use similarly large vocabularies (Claude’s is around 100,000 with Byte-Pair Encoding). These logits typically range from -15 to +15, and they represent how “compatible” each token is with the context so far.
A logit of +12 for “apple” doesn’t mean “apple” has a 12% chance. It means “apple” is very compatible relative to everything else. The actual probability depends on how all 50,000+ logits compare to each other — which is where softmax comes in later.
Step 2: Repetition Penalty — Suppressing the Loop
Before probabilities are calculated, most production systems apply a repetition penalty. This divides the logits of tokens that already appear in the generated sequence by a penalty factor — commonly 1.1 to 1.3.
Repetition penalty with alpha=1.2, implemented in Hugging Face Transformers since version 2.0, divides previously generated token logits by 1.2 each time they appear, reducing “the the the” loops by up to 90% in greedy decoding. This is critical because without it, autoregressive models have a strong tendency to repeat recently generated tokens — a phenomenon researchers call “degeneration.”
The key subtlety: repetition penalty applies before softmax. A logit of 8.0 for a repeated token becomes 6.67 after division by 1.2. That 1.33 difference might seem small, but after softmax exponentiation, it can drop the token’s probability from 15% to under 2%.
Step 3: Temperature — The Distribution Shaper
Temperature is the most widely misunderstood parameter in all of AI. Here’s what it actually does mathematically: it divides every logit by the temperature parameter T before exponentiation. That’s it. But the effects are profound.
The softmax temperature formula divides logits by T before exponentiation, meaning temperature 0.5 squares the probability ratios while temperature 2.0 takes their square root, geometrically reshaping the entire distribution.
“GPT-3’s default temperature of 0.7 means the model rejects roughly 30% of probability mass before sampling, while temperature 2.0 can make a 0.01% probable token 100x more likely to be selected”
At temperature 0 (greedy decoding), softmax with infinite sharpness always picks the single highest-scoring token. At temperature 1, you get the model’s raw probability distribution. At temperature 2, the distribution flattens so dramatically that even a token with 0.01% probability can become 100 times more likely — which is why temperature 2.0 outputs often read like fever dreams.
Step 4: Softmax — Converting Scores to Probabilities
After temperature scaling, the softmax function converts logits into a proper probability distribution:
P(token_i) = e^(logit_i / T) / Σ e^(logit_j / T) for all tokens j
This ensures all probabilities sum to exactly 1.0. The softmax function is where the exponential nature of the distribution becomes visible — a logit difference of just 2.0 between two tokens can translate to a 7:1 probability ratio. Small logit differences compound dramatically after exponentiation.
In practice, numerical stability requires subtracting the maximum logit before computing softmax (the “log-sum-exp trick”), otherwise the exponentials overflow for logit values above ~88. Every production inference engine implements this.
Step 5: Sorting — Preparing the Candidate Pool
With probabilities computed, the system sorts all tokens by probability in descending order. This isn’t just for convenience — it’s essential for the filtering steps that follow. Both top-k and top-p require knowing which tokens are most probable.
This sort operation over 50,000+ tokens is one reason why sampling is not free. Modern frameworks like vLLM use partial sorts (only finding the top-k elements rather than fully sorting) to reduce this cost. For k=50, a partial heap sort is O(n log k) instead of O(n log n).
Step 6: Filtering — Top-k vs Top-p (Nucleus Sampling)
This is where the two main sampling strategies diverge, and understanding the difference is crucial for anyone building with LLMs.
Top-k sampling keeps only the k highest-probability tokens and discards everything else. Top-k sampling with k=50, introduced by Fan et al. in 2018, cuts off all but the 50 most probable tokens before sampling, but fails catastrophically when the true distribution has fewer than 50 viable options — in those cases, it forces the model to consider tokens it has essentially zero confidence in.
Top-p (nucleus) sampling takes a smarter approach. Nucleus sampling (top-p) with p=0.9, published by Holtzman et al. in February 2019, dynamically adjusts the candidate pool from as few as 1 token to over 1000 tokens depending on probability distribution shape. When the model is very confident (a sharp distribution), top-p 0.9 might only need 2-3 tokens. When the model is uncertain (a flat distribution), it might need hundreds.
This adaptivity is why top-p has largely replaced top-k in production systems. OpenAI’s API, Anthropic’s API, and most open-source serving frameworks default to top-p. The practical rule: use top-p=0.9 as your starting point, and only adjust top-k if you need a hard ceiling on compute cost per token.
Step 7: Renormalization — Making Probabilities Sum to One Again
After filtering discards the low-probability tail, the remaining tokens’ probabilities no longer sum to 1.0. Renormalization fixes this by dividing each remaining probability by the sum of all remaining probabilities.
If top-p 0.9 kept tokens summing to 0.92, each token’s probability gets divided by 0.92 — slightly boosting every surviving token. This seems trivial, but it’s essential: without renormalization, the sampling distribution is invalid and the model’s behavior becomes unpredictable.
In edge cases, renormalization can amplify problematic tokens. If filtering only removes 1% of the probability mass, renormalization barely changes anything. But if it removes 50% (aggressive top-k), the surviving tokens get doubled in probability — including tokens the model was already uncertain about.
Step 8: The Final Draw — Sampling and Repeating
With a clean, renormalized distribution in hand, the system samples exactly one token. The most common method is inverse transform sampling: generate a uniform random number between 0 and 1, walk down the cumulative probability distribution, and pick the token where the cumulative sum exceeds the random value.
But there’s a faster trick: the Gumbel-max trick adds Gumbel noise to each logit and then simply takes the argmax. This produces the exact same distribution as inverse transform sampling but with a single pass through the vocabulary — no cumulative sums needed. Modern GPU kernels in frameworks like TensorRT-LLM and vLLM use variants of this for speed.
Once the token is selected, it’s appended to the sequence and the entire process repeats: new logits, new temperature, new softmax, new filter, new sample. Each generated token requires a complete forward pass through the transformer — which is why generating 1000 tokens takes roughly 1000x longer than processing a prompt of the same length.
When Sampling Breaks — The Hidden Failure Mode
All of this elegant math has a failure mode that costs companies millions in wasted API calls every year. When temperature is set too high and top-p too loose, the model enters a degenerate loop: it generates plausible-sounding but factually empty text, consuming tokens and money without producing value. The worst part? The output looks reasonable at a glance — it takes careful reading to realize the model has “gone off the rails.”
Production systems combat this with stop sequences, length limits, and logprobs monitoring. But the fundamental issue remains: the sampling pipeline is a probabilistic machine, and probability distributions have long tails. The more you open up the candidate pool, the more likely you are to visit the tail.
And that’s the real reason understanding this pipeline matters. Temperature, top-p, and repetition penalty aren’t just API knobs — they’re levers on a probability engine that determines whether your AI product delights users or burns through your budget generating nonsense.
Quick Reference: Optimal Settings for Common Use Cases
- Coding / factual Q&A: temperature 0.0–0.3, top-p 0.9 — minimize randomness, maximize accuracy
- Creative writing: temperature 0.7–1.0, top-p 0.95 — allow creative variation while staying coherent
- Brainstorming / ideation: temperature 1.0–1.5, top-p 0.95–0.99 — maximize diversity, accept some chaos
- Production chatbots: temperature 0.2–0.5, top-p 0.85, repetition_penalty 1.1 — balance consistency with natural variation
- Data extraction / classification: temperature 0.0, top-p 1.0 (disabled) — deterministic outputs only
But what happens when sampling breaks? The hidden failure mode costs companies millions in API calls — and most teams don’t even realize it’s happening until they audit their token logs.
Built by us: Exit Pop Pro
Turn your WordPress visitors into email subscribers with an exit-intent popup that gives away a free PDF. $29 one-time — no monthly fees, no SaaS lock-in.

