This directory contains scripts for evaluating how the finetune endpoint affects performance on a project and general coding benchmarks. Note that the eval was a WIP when this project was archived. The basic eval script is functional, but the RAG eval script is not feature complete.
To run an eval, you first need to add your own data as described in the Evaluation Algorithm section below. The data should mimic the types of projects you might want to finetune and get suggestions for.
You can run the evaluation by navigating to the eval
directory and running the following command:
python run_eval.py
You can run eval RAG specific performance by running the following command:
python run_rag_eval.py
The purpose of the RAG eval is to test how much RAG improves performance on a project. Note that RAG with generic pretrained models did not help in our experiments, and the feature implementation was incomplete. We do not recommend using RAG unless you plan to build on the existing implementation.
- Test how much finetuning on relevant materials improves performance on a project
- Test how much finetuning on irrelevant projects hurts performance on a project
- Test how much finetuning on arbitrary materials hurts performance on general coding benchmarks
-
Assume data in
data/
with the following directory structure:metadata.json data/{project_name}/ train/ # This is the code that will be fine-tuned on code/ {code_file_and_dirs} documents/ {doc_file_and_dirs} (currently unused, in the future could support documents like PDFs) links.txt (each link should be separated by a newline) test/ # This is the code for which we will generate suggestions {test_code_file_and_dirs}
Our official eval dataset can be downloaded separetely from the repository at this Google Drive link.
-
Evaluate on generic coding benchmarks with a fresh model.
-
Load each of the project directories and corresponding test directories.
-
For each project:
- Reinitialize a new model.
- Evaluate on the test data.
- Train on the training data with the finetune endpoint.
- Evaluate on the test data again.
- If specified in
metadata.json
, evaluate on generic coding benchmarks.
-
When evaluating, log the following metrics:
- Loss on training data
- Loss on test data
- Change in loss on training data
- Change in loss on test data
- Change in loss on generic coding benchmarks
-
After evaluating on all projects, collate the results to log:
- Averages of the above metrics
- Change in test loss averaged by project type (related, unrelated)
Each project directory should contain a metadata.json
file with the following fields:
test_type
: one ofrelated
,unrelated
generic_benchmarks
: a list of generic benchmarks to evaluate on, valid values: {HumanEval
,all
}