EvalPlus

EvalPlus is a framework for evaluating the real coding ability of LLMs using large, high-quality auto-generated test cases. It goes beyond basic benchmarks to assess code correctness, robustness and real-world reliability.

Many models pass simple tests but fail on harder or unseen cases. EvalPlus uncovers these weaknesses by verifying the code across diverse and challenging inputs.

Generates many high quality, diverse test cases automatically
Ensures tests match the problem requirements
Detects brittle or partially correct solutions
Measures true correctness instead of pattern memorization

EvalPlus Workflow

EvalPlus follows a systematic, multi-stage workflow to rigorously evaluate the functional correctness and generalization ability of LLM-generated code.

1. Original Dataset (Seed Tests)

EvalPlus starts with original benchmark datasets such as MBPP and HumanEval. These datasets contain:

Programming problems
Reference (ground-truth) solutions
A limited set of manually written test cases

These original test cases act as seed inputs providing a reliable but insufficient baseline for correctness evaluation.

2. Seed Input Collection

The original test cases are extracted and used as seed inputs. Seed inputs represent valid, minimal examples that satisfy the problem constraints and serve as the foundation for generating more challenging test cases.

3. Type-Aware Mutation (Input Generation)

EvalPlus uses ChatGPT to generate new test inputs through type-aware mutation. Unlike random fuzzing, EvalPlus:

Understands input data types
Preserves semantic validity
Introduces edge cases, boundary values and rare scenarios

4. New Input Pool Creation

All valid mutated inputs are collected into a new input pool which includes:

Original seed inputs
Newly generated type-correct mutated inputs

This pool significantly increases test diversity while maintaining correctness and relevance.

5. Automatic Test Expansion (EvalPlus Dataset)

EvalPlus merges seed and mutated inputs to form large-scale test suites resulting in:

MBPP+
HumanEval+

These expanded benchmarks are collectively referred to as the EvalPlus Dataset, providing much stronger coverage than the original datasets.

6. Differential Testing

EvalPlus applies differential testing to evaluate LLM-generated solutions.

For each test input x:

The ground-truth solution produces gt(x)
The LLM-generated solution produces f(x)

A test is passed if:

LLM_output(x) == GroundTruth_output(x)

7. LLM Samples

LLM samples refer to code solutions generated by large language models for the given programming tasks. These solutions may appear correct on the original test cases but often fail under expanded testing.

8. Test Suite Reduction

To keep evaluation efficient, EvalPlus applies test suite reduction, removing redundant test cases while preserving failure-detection capability. This step ensures:

Lower evaluation cost
Faster benchmarking
No loss in rigor

9. Rigorously Validated LLM Samples

An LLM-generated solution is marked rigorously validated only if it:

Passes all reduced expanded test cases
Matches the ground-truth output for every input

rigorously_validated_llm_samples — Rigorously Validated LLM Samples

Code Evaluation in EvalPlus

EvalPlus includes EvalPerf, a module that measures the efficiency of LLM generated code on Linux using low level perf metrics, supporting both local and Docker based execution. It enables fast multi-sample performance analysis with vLLM, requiring perf_event_paranoid to be disabled.

EvalPerf Setup and Execution

1. Install EvalPerf: Installs EvalPerf with performance tracing and vLLM support.

Python

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
pip install "evalplus[perf,vllm]" --upgrade

2. Enable perf: Allows the Linux system to capture low level performance metrics.

Python

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'

3. Run EvalPerf (vLLM backend): Executes performance evaluation using the Magicoder model with fast vLLM inference.

Python

evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

Safe Code Execution (Docker)

EvalPlus supports fully isolated code execution using Docker for safer evaluation.

1. Local Generation: Generates code samples locally using the vLLM backend for EvalPerf evaluation.

Python

evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset evalperf \
                 --backend vllm \
                 --temperature 1.0 \
                 --n-samples 100

2. Docker Execution: Executes EvalPerf securely inside a Docker container with hardware performance monitoring enabled.

Python

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid'

docker run --cap-add PERFMON --rm --pull=always \
  -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
  evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

Supported LLM Backends

EvalPlus supports multiple backends, enabling flexible evaluation across local, high-performance and API-based LLMs.

1. HuggingFace (Transformers Backend): Runs evaluation using HuggingFace transformer models locally.

Python

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp] \
                  --backend hf \
                  --greedy

Enable Flash Attention 2 (FA2): Improves attention speed and memory efficiency.

Python

pip install packaging ninja
pip install flash-attn --no-build-isolation
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp] \
                  --backend hf \
                  --attn-implementation [flash_attention_2|sdpa] \
                  --greedy

2. OpenAI Compatible Backends: Evaluates using official OpenAI hosted models.

Python

export OPENAI_API_KEY="{KEY}"
evalplus.evaluate --model "gpt-4o-2024-08-06" \
                  --dataset [humaneval|mbpp] \
                  --backend openai --greedy

EvalPlus Datasets & Benchmarks

EvalPlus strengthens two popular coding benchmarks MBPP and HumanEval into more rigorous versions called MBPP+ and HumanEval+ using extensive new test cases.

MBPP+: An enhanced version of MBPP with many additional test cases to verify code behavior across normal, edge and tricky inputs.
HumanEval+: An upgraded HumanEval benchmark with 80× more test cases to evaluate code robustness against corner and failure-prone cases.

Advantages

Provides highly accurate evaluation using large and diverse test suites.
Automatically generates test cases without manual effort.
Ensures fair and reproducible model comparison.
Detects hidden bugs and edge-case failures.
Supports Python, Docker, vLLM and remote backends.
Evaluates both correctness and execution speed via EvalPerf.

Limitations

Works only for Python-based coding tasks.
Requires correct ground-truth reference solutions.
Large test sets increase evaluation runtime.
Does not assess code quality or readability.

EvalPlus Workflow

1. Original Dataset (Seed Tests)

2. Seed Input Collection

3. Type-Aware Mutation (Input Generation)

4. New Input Pool Creation

5. Automatic Test Expansion (EvalPlus Dataset)

6. Differential Testing

7. LLM Samples

8. Test Suite Reduction

9. Rigorously Validated LLM Samples

Code Evaluation in EvalPlus

EvalPerf Setup and Execution

Safe Code Execution (Docker)

Supported LLM Backends

EvalPlus Datasets & Benchmarks

Advantages

Limitations

Explore