EvalPlus

Last Updated : 13 Feb, 2026

EvalPlus is a framework for evaluating the real coding ability of LLMs using large, high-quality auto-generated test cases. It goes beyond basic benchmarks to assess code correctness, robustness and real-world reliability.

Many models pass simple tests but fail on harder or unseen cases. EvalPlus uncovers these weaknesses by verifying the code across diverse and challenging inputs.

  • Generates many high quality, diverse test cases automatically
  • Ensures tests match the problem requirements
  • Detects brittle or partially correct solutions
  • Measures true correctness instead of pattern memorization

EvalPlus Workflow

EvalPlus follows a systematic, multi-stage workflow to rigorously evaluate the functional correctness and generalization ability of LLM-generated code.

frame_33244
Eval Plus Workflow

1. Original Dataset (Seed Tests)

EvalPlus starts with original benchmark datasets such as MBPP and HumanEval. These datasets contain:

  • Programming problems
  • Reference (ground-truth) solutions
  • A limited set of manually written test cases
original_dataset
Original Dataset

These original test cases act as seed inputs providing a reliable but insufficient baseline for correctness evaluation.

2. Seed Input Collection

The original test cases are extracted and used as seed inputs. Seed inputs represent valid, minimal examples that satisfy the problem constraints and serve as the foundation for generating more challenging test cases.

3. Type-Aware Mutation (Input Generation)

EvalPlus uses ChatGPT to generate new test inputs through type-aware mutation. Unlike random fuzzing, EvalPlus:

  • Understands input data types
  • Preserves semantic validity
  • Introduces edge cases, boundary values and rare scenarios

4. New Input Pool Creation

All valid mutated inputs are collected into a new input pool which includes:

  • Original seed inputs
  • Newly generated type-correct mutated inputs

This pool significantly increases test diversity while maintaining correctness and relevance.

5. Automatic Test Expansion (EvalPlus Dataset)

evalplus_dataset
EvalPlus Dataset

EvalPlus merges seed and mutated inputs to form large-scale test suites resulting in:

  • MBPP+
  • HumanEval+

These expanded benchmarks are collectively referred to as the EvalPlus Dataset, providing much stronger coverage than the original datasets.

6. Differential Testing

EvalPlus applies differential testing to evaluate LLM-generated solutions.

For each test input x:

  • The ground-truth solution produces gt(x)
  • The LLM-generated solution produces f(x)

A test is passed if:

LLM_output(x) == GroundTruth_output(x)

7. LLM Samples

LLM samples refer to code solutions generated by large language models for the given programming tasks. These solutions may appear correct on the original test cases but often fail under expanded testing.

llm_samples
LLM Samples

8. Test Suite Reduction

To keep evaluation efficient, EvalPlus applies test suite reduction, removing redundant test cases while preserving failure-detection capability. This step ensures:

  • Lower evaluation cost
  • Faster benchmarking
  • No loss in rigor
test_suite_reduction
Test Suite Reduction

9. Rigorously Validated LLM Samples

An LLM-generated solution is marked rigorously validated only if it:

  • Passes all reduced expanded test cases
  • Matches the ground-truth output for every input
rigorously_validated_llm_samples
Rigorously Validated LLM Samples

Code Evaluation in EvalPlus

EvalPlus includes EvalPerf, a module that measures the efficiency of LLM generated code on Linux using low level perf metrics, supporting both local and Docker based execution. It enables fast multi-sample performance analysis with vLLM, requiring perf_event_paranoid to be disabled.

EvalPerf Setup and Execution

1. Install EvalPerf: Installs EvalPerf with performance tracing and vLLM support.

Python
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
pip install "evalplus[perf,vllm]" --upgrade

2. Enable perf: Allows the Linux system to capture low level performance metrics.

Python
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'

3. Run EvalPerf (vLLM backend): Executes performance evaluation using the Magicoder model with fast vLLM inference.

Python
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

Safe Code Execution (Docker)

EvalPlus supports fully isolated code execution using Docker for safer evaluation.

1. Local Generation: Generates code samples locally using the vLLM backend for EvalPerf evaluation.

Python
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset evalperf \
                 --backend vllm \
                 --temperature 1.0 \
                 --n-samples 100

2. Docker Execution: Executes EvalPerf securely inside a Docker container with hardware performance monitoring enabled.

Python
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'

docker run --cap-add PERFMON --rm --pull=always \
  -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
  evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

Supported LLM Backends

EvalPlus supports multiple backends, enabling flexible evaluation across local, high-performance and API-based LLMs.

1. HuggingFace (Transformers Backend): Runs evaluation using HuggingFace transformer models locally.

Python
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp] \
                  --backend hf \
                  --greedy

Enable Flash Attention 2 (FA2): Improves attention speed and memory efficiency.

Python
pip install packaging ninja
pip install flash-attn --no-build-isolation
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp] \
                  --backend hf \
                  --attn-implementation [flash_attention_2|sdpa] \
                  --greedy

2. OpenAI Compatible Backends: Evaluates using official OpenAI hosted models.

Python
export OPENAI_API_KEY="{KEY}"
evalplus.evaluate --model "gpt-4o-2024-08-06" \
                  --dataset [humaneval|mbpp] \
                  --backend openai --greedy

EvalPlus Datasets & Benchmarks

EvalPlus strengthens two popular coding benchmarks MBPP and HumanEval into more rigorous versions called MBPP+ and HumanEval+ using extensive new test cases.

  • MBPP+: An enhanced version of MBPP with many additional test cases to verify code behavior across normal, edge and tricky inputs.
  • HumanEval+: An upgraded HumanEval benchmark with 80× more test cases to evaluate code robustness against corner and failure-prone cases.

Advantages

  • Provides highly accurate evaluation using large and diverse test suites.
  • Automatically generates test cases without manual effort.
  • Ensures fair and reproducible model comparison.
  • Detects hidden bugs and edge-case failures.
  • Supports Python, Docker, vLLM and remote backends.
  • Evaluates both correctness and execution speed via EvalPerf.

Limitations

  • Works only for Python-based coding tasks.
  • Requires correct ground-truth reference solutions.
  • Large test sets increase evaluation runtime.
  • Does not assess code quality or readability.
Comment

Explore