Temperature, Top-K and Top-P Sampling in LLMs

Last Updated : 4 Feb, 2026

Sampling techniques control how language models choose the next word during text generation. The model assigns probabilities to possible words and sampling determines which one is picked. By adjusting these methods, you can balance creativity and accuracy in generated responses.

  • Temperature controls randomness in predictions
  • Top-K limits choices to the most probable tokens
  • Top-P selects tokens based on cumulative probability
  • Used to tune output diversity and coherence

Temperature Sampling in LLMs

Temperature controls how random the model’s output is and typically ranges from 0 to 2

  • Low temperature: Safer, more predictable text
  • High temperature: More creative and varied text
temperature
Temperature Sampling

Low temperature one clear choice (car) and High temperature many possible choices

How it works

Before choosing the next word, the model adjusts word probabilities using the temperature setting

Low temperature:

  • Strongly favours high-probability words
  • Produces stable and predictable text

High temperature:

  • Flattens the probability distribution
  • Allows less likely words to appear more often
  • Increases creativity but may reduce accuracy

Example

  • Temperature = 0.2: factual, low creativity
  • Temperature = 1.0: balanced output
  • Temperature = 1.5: creative but less reliable

Implementation

  • Loads a pre-trained GPT-2 tokenizer and language model
  • Takes a text prompt and converts it into tokens
  • Generates text by predicting the next words step by step
  • Applies temperature (0.7) to control creativity and limits output length
  • Converts the output back to readable text and prints it
Python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Explain AI in simple terms", return_tensors="pt")

output = model.generate(
    **inputs,
    max_length=50,
    temperature=0.7,   
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:

Explain AI in simple terms
It's not too hard to learn to use artificial intelligence. The problem is that what you see here is not a single AI system. You'll see that the AI systems are different from the ones we've seen

Advantages

  • Allows control over the balance between accuracy and creativity
  • Produces reliable outputs when accuracy is needed
  • Encourages diverse ideas when creativity is preferred

Top-K Sampling in LLMs

Top-K limits the model to choosing the next word from only the K most likely options, ignoring all other possibilities. This helps control randomness by keeping the selection focused on higher-probability words.

top_k
Top k Sampling

Only the top K tokens are considered; one is sampled from them.

How it works

  • The model ranks all possible next words by probability
  • Keeps only the top K most likely words
  • Randomly selects one word from this limited set

Example

  • Top K = 50: selects from the 50 most likely words
  • Smaller K: safer output
  • Larger K: more variety

Implementation:

  • Loads a pre-trained GPT-2 tokenizer and model
  • Converts the input text into tokens
  • Generates text by predicting the next words
  • Uses Top-K sampling (K = 50) to limit word choices and reduce unlikely outputs
  • Decodes and prints the generated text
Python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Explain AI in simple terms", return_tensors="pt")

output = model.generate(
    **inputs,
    max_length=50,
    top_k=50,          
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:

Explain AI in simple terms.
You may find yourself with the following problems. You have a basic understanding of your current behaviour. If you've tried it out, you'll find that when you go through the troubleshooting, every problem is

Advantages

  • Removes very unlikely words from consideration
  • Reduces strange or incorrect outputs
  • Helps produce cleaner and more reliable text

Top-P (Nucleus) Sampling in LLMs

Top P selects the next word based on cumulative probability instead of a fixed number of options, allowing the set of possible choices to grow or shrink depending on how confident the model is.

top_p_sampling
Top P Sampling

How it works

  • Words are ranked by their probability
  • Starting from the most likely word, words are added one by one
  • The model stops when the combined probability reaches or exceeds P
  • One word is then randomly selected from this group

Example

  • Top P = 0.9: selects from words that together account for 90% of the total probability
  • Lower P: fewer choices, safer output
  • Higher P: more choices, more creative output

Implementation:

  • Loads a pre-trained GPT-2 tokenizer and model
  • Converts the input text into tokens
  • Generates text by predicting the next words
  • Uses Top-P sampling (P = 0.9) to select from words covering 90% probability
  • Decodes and prints the generated text
Python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Explain AI in simple terms", return_tensors="pt")

output = model.generate(
    **inputs,
    max_length=50,
    top_p=0.9,         
    do_sample=True
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:

Explain AI in simple terms:

- Give AI a set value.

- Add it to your AI.

- Add AI to the game.

- Put AI into game to improve AI performance.

- Have

Advantages

  • The number of available choices adapts to the model’s confidence
  • Fewer options when the model is confident, leading to safer output
  • More options when confidence is lower, allowing greater creativity
  • More flexible than Top-K sampling

Temperature vs Top-K vs Top-P

Now lets see the key differences in how sampling methods control creativity and reliability

Factor

Temperature

Top-K

Top-P (Nucleus)

What it controls

Randomness of output

Number of allowed words

Total probability mass

How it works

Rescales word probabilities

Keeps top K words

Keeps words until probability ≥ P

Main purpose

Balance creativity vs accuracy

Remove unlikely words

Adaptive, confidence-based sampling

Effect on creativity

Higher means more creative

Higher K means more variety

Higher P means more creativity

Typical values

0.2 – 1.5

10 – 100

0.8 – 0.95

Best used when

You want control over randomness

You want strict limits

You want flexible control

You can download full code from here

Comment

Explore