Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
benchmarks.py		benchmarks.py
data_loading.py		data_loading.py
human_eval_test.ipynb		human_eval_test.ipynb
metrics.py		metrics.py
run_eval.py		run_eval.py
run_rag_eval.py		run_rag_eval.py
utils.py		utils.py

README.md

Evaluation

This directory contains scripts for evaluating how the finetune endpoint affects performance on a project and general coding benchmarks. Note that the eval was a WIP when this project was archived. The basic eval script is functional, but the RAG eval script is not feature complete.

Running the Evaluation

To run an eval, you first need to add your own data as described in the Evaluation Algorithm section below. The data should mimic the types of projects you might want to finetune and get suggestions for.

You can run the evaluation by navigating to the eval directory and running the following command:

python run_eval.py

You can run eval RAG specific performance by running the following command:

python run_rag_eval.py

The purpose of the RAG eval is to test how much RAG improves performance on a project. Note that RAG with generic pretrained models did not help in our experiments, and the feature implementation was incomplete. We do not recommend using RAG unless you plan to build on the existing implementation.

Objectives of the Evaluation

Test how much finetuning on relevant materials improves performance on a project
Test how much finetuning on irrelevant projects hurts performance on a project
Test how much finetuning on arbitrary materials hurts performance on general coding benchmarks

Evaluation Algorithm

Assume data in data/ with the following directory structure:

metadata.json
data/{project_name}/
    train/  # This is the code that will be fine-tuned on
        code/
            {code_file_and_dirs}
        documents/
            {doc_file_and_dirs} (currently unused, in the future could support documents like PDFs)
        links.txt (each link should be separated by a newline)
    test/  # This is the code for which we will generate suggestions
        {test_code_file_and_dirs}

Our official eval dataset can be downloaded separetely from the repository at this Google Drive link.

Evaluate on generic coding benchmarks with a fresh model.
Load each of the project directories and corresponding test directories.
For each project:
- Reinitialize a new model.
- Evaluate on the test data.
- Train on the training data with the finetune endpoint.
- Evaluate on the test data again.
- If specified in metadata.json, evaluate on generic coding benchmarks.
When evaluating, log the following metrics:
- Loss on training data
- Loss on test data
- Change in loss on training data
- Change in loss on test data
- Change in loss on generic coding benchmarks
After evaluating on all projects, collate the results to log:
- Averages of the above metrics
- Change in test loss averaged by project type (related, unrelated)

Metadata

Each project directory should contain a metadata.json file with the following fields:

test_type: one of related, unrelated
generic_benchmarks: a list of generic benchmarks to evaluate on, valid values: {HumanEval, all}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval

eval

README.md

Evaluation

Running the Evaluation

Objectives of the Evaluation

Evaluation Algorithm

Metadata

Files

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

Evaluation

Running the Evaluation

Objectives of the Evaluation

Evaluation Algorithm

Metadata