diff --git a/Contest/ExampleSimAnalysis/README.md b/Contest/ExampleSimAnalysis/README.md new file mode 100644 index 0000000..2138e48 --- /dev/null +++ b/Contest/ExampleSimAnalysis/README.md @@ -0,0 +1,89 @@ +# Example of contest test set evaluation + +Author: [Vladimir Zolotov](mailto:zolotov@us.ibm.com) + + +This directory holds an example of program `TestSetEval.py` and `scrip test_eval.sh` for evaluating a contest test set on source code similarity. + +The program is given only as an example. It is expected that the contestants write their own test set evaluation program better suitable for their operational environment, file formats and ML framework. + +The program `TestSetEval.py` accepts a test set consisting of two components: + +1. Directory with source code files of C++ programs to detect similarity or dissimilarity with each other. +2. csv file of a test set in the following format: + +`,,` + +where `path to 1-st file` and `path to 2-nd file` specify paths to the pair of files to detect similarity or dissimilarity. The paths are specified relative to the directory with source code files. Here is an example of the csv file of the test set: + +``` +pair-id,file1,file2 +1,p02761/s682789980.cpp,p02761/s060579067.cpp +2,p02817/s224840938.cpp,p03360/s804824036.cpp +3,p01085/s591760635.cpp,p01085/s734466918.cpp +``` + +The program `TestSetEval.py` can also accept a csv file of ground truth labels in the following format: +`,