Temperature, Top-K and Top-P Sampling in LLMs

Sampling techniques control how language models choose the next word during text generation. The model assigns probabilities to possible words and sampling determines which one is picked. By adjusting these methods, you can balance creativity and accuracy in generated responses.

Temperature controls randomness in predictions
Top-K limits choices to the most probable tokens
Top-P selects tokens based on cumulative probability
Used to tune output diversity and coherence

Temperature Sampling in LLMs

Temperature controls how random the model’s output is and typically ranges from 0 to 2

Low temperature: Safer, more predictable text
High temperature: More creative and varied text

Low temperature one clear choice (car) and High temperature many possible choices

How it works

Before choosing the next word, the model adjusts word probabilities using the temperature setting

Low temperature:

Strongly favours high-probability words
Produces stable and predictable text

High temperature:

Flattens the probability distribution
Allows less likely words to appear more often
Increases creativity but may reduce accuracy

Example

Temperature = 0.2: factual, low creativity
Temperature = 1.0: balanced output
Temperature = 1.5: creative but less reliable

Implementation

Loads a pre-trained GPT-2 tokenizer and language model
Takes a text prompt and converts it into tokens
Generates text by predicting the next words step by step
Applies temperature (0.7) to control creativity and limits output length
Converts the output back to readable text and prints it

Python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Explain AI in simple terms", return_tensors="pt")

output = model.generate(
    **inputs,
    max_length=50,
    temperature=0.7,   
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:

Explain AI in simple terms
It's not too hard to learn to use artificial intelligence. The problem is that what you see here is not a single AI system. You'll see that the AI systems are different from the ones we've seen

Advantages

Allows control over the balance between accuracy and creativity
Produces reliable outputs when accuracy is needed
Encourages diverse ideas when creativity is preferred

Top-K Sampling in LLMs

Top-K limits the model to choosing the next word from only the K most likely options, ignoring all other possibilities. This helps control randomness by keeping the selection focused on higher-probability words.

Only the top K tokens are considered; one is sampled from them.

How it works

The model ranks all possible next words by probability
Keeps only the top K most likely words
Randomly selects one word from this limited set

Example

Top K = 50: selects from the 50 most likely words
Smaller K: safer output
Larger K: more variety

Implementation:

Loads a pre-trained GPT-2 tokenizer and model
Converts the input text into tokens
Generates text by predicting the next words
Uses Top-K sampling (K = 50) to limit word choices and reduce unlikely outputs
Decodes and prints the generated text

Python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Explain AI in simple terms", return_tensors="pt")

output = model.generate(
    **inputs,
    max_length=50,
    top_k=50,          
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:

Explain AI in simple terms.
You may find yourself with the following problems. You have a basic understanding of your current behaviour. If you've tried it out, you'll find that when you go through the troubleshooting, every problem is

Advantages

Removes very unlikely words from consideration
Reduces strange or incorrect outputs
Helps produce cleaner and more reliable text

Top-P (Nucleus) Sampling in LLMs

Top P selects the next word based on cumulative probability instead of a fixed number of options, allowing the set of possible choices to grow or shrink depending on how confident the model is.

How it works

Words are ranked by their probability
Starting from the most likely word, words are added one by one
The model stops when the combined probability reaches or exceeds P
One word is then randomly selected from this group

Example

Top P = 0.9: selects from words that together account for 90% of the total probability
Lower P: fewer choices, safer output
Higher P: more choices, more creative output

Implementation:

Loads a pre-trained GPT-2 tokenizer and model
Converts the input text into tokens
Generates text by predicting the next words
Uses Top-P sampling (P = 0.9) to select from words covering 90% probability
Decodes and prints the generated text

Python

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Explain AI in simple terms", return_tensors="pt")

output = model.generate(
    **inputs,
    max_length=50,
    top_p=0.9,         
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:

Explain AI in simple terms:
- Give AI a set value.
- Add it to your AI.
- Add AI to the game.
- Put AI into game to improve AI performance.
- Have

Advantages

The number of available choices adapts to the model’s confidence
Fewer options when the model is confident, leading to safer output
More options when confidence is lower, allowing greater creativity
More flexible than Top-K sampling

Temperature vs Top-K vs Top-P

Now lets see the key differences in how sampling methods control creativity and reliability

Factor	Temperature	Top-K	Top-P (Nucleus)
What it controls	Randomness of output	Number of allowed words	Total probability mass
How it works	Rescales word probabilities	Keeps top K words	Keeps words until probability ≥ P
Main purpose	Balance creativity vs accuracy	Remove unlikely words	Adaptive, confidence-based sampling
Effect on creativity	Higher means more creative	Higher K means more variety	Higher P means more creativity
Typical values	0.2 – 1.5	10 – 100	0.8 – 0.95
Best used when	You want control over randomness	You want strict limits	You want flexible control

You can download full code from here

Temperature, Top-K and Top-P Sampling in LLMs

Temperature Sampling in LLMs

How it works

Implementation

Advantages

Top-K Sampling in LLMs

How it works

Implementation:

Advantages

Top-P (Nucleus) Sampling in LLMs

How it works

Implementation:

Advantages

Temperature vs Top-K vs Top-P

Explore