|
| 1 | +# Benchmarking Configs Specification |
| 2 | + |
| 3 | +## Config Structure |
| 4 | + |
| 5 | +Benchmark config files are written in JSON format and have a few reserved keys: |
| 6 | + - `INCLUDE` - Other configuration files whose parameter sets to include |
| 7 | + - `PARAMETERS_SETS` - Benchmark parameters within each set |
| 8 | + - `TEMPLATES` - List different setups with parameters sets template-specific parameters |
| 9 | + - `SETS` - List parameters sets to include in the template |
| 10 | + |
| 11 | +Configs heavily utilize lists of scalar values and dictionaries to avoid duplication of cases. |
| 12 | + |
| 13 | +Formatting specification: |
| 14 | +```json |
| 15 | +{ |
| 16 | + "INCLUDE": [ |
| 17 | + "another_config_file_path_0" |
| 18 | + ... |
| 19 | + ], |
| 20 | + "PARAMETERS_SETS": { |
| 21 | + "parameters_set_name_0": Dict or List[Dict] of any JSON-serializable with any level of nesting, |
| 22 | + ... |
| 23 | + }, |
| 24 | + "TEMPLATES": { |
| 25 | + "template_name_0": { |
| 26 | + "SETS": ["parameters_set_name_0", ...], |
| 27 | + Dict of any JSON-serializable with any level of nesting overwriting parameter sets |
| 28 | + }, |
| 29 | + ... |
| 30 | + } |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +Example |
| 35 | +```json |
| 36 | +{ |
| 37 | + "PARAMETERS_SETS": { |
| 38 | + "estimator parameters": { |
| 39 | + "algorithm": { |
| 40 | + "estimator": "LinearRegression", |
| 41 | + "estimator_params": { |
| 42 | + "fit_intercept": false |
| 43 | + } |
| 44 | + } |
| 45 | + }, |
| 46 | + "regression data": { |
| 47 | + "data": [ |
| 48 | + { "source": "fetch_openml", "id": 1430 }, |
| 49 | + { "dataset": "california_housing" } |
| 50 | + ] |
| 51 | + } |
| 52 | + }, |
| 53 | + "TEMPLATES": { |
| 54 | + "linear regression": { |
| 55 | + "SETS": ["estimator parameters", "regression data"], |
| 56 | + "algorithm": { |
| 57 | + "library": ["sklearn", "sklearnex", "cuml"] |
| 58 | + } |
| 59 | + } |
| 60 | + } |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +## Common Parameters |
| 65 | + |
| 66 | +Configs have the three highest parameter keys: |
| 67 | + - `bench` - Specifies a workflow of the benchmark, such as parameters of measurement or profiling |
| 68 | + - `algorithm` - Specifies measured entity parameters |
| 69 | + - `data` - Specifies data parameters to use |
| 70 | + |
| 71 | +| Parameter keys | Default value | Choices | Description | |
| 72 | +|:---------------|:--------------|:--------|:------------| |
| 73 | +|<h3>Benchmark workflow parameters</h3>|||| |
| 74 | +| `bench`:`taskset` | None | | Value for `-c` argument of `taskset` utility used over benchmark subcommand. | |
| 75 | +| `bench`:`vtune_profiling` | None | | Analysis type for `collect` argument of Intel(R) VTune* Profiler tool. Linux* OS only. | |
| 76 | +| `bench`:`vtune_results_directory` | `_vtune_results` | | Directory path to store Intel(R) VTune* Profiler results. | |
| 77 | +| `bench`:`n_runs` | `10` | | Number of runs for measured entity. | |
| 78 | +| `bench`:`time_limit` | `3600` | | Time limit in seconds before the benchmark early stop. | |
| 79 | +| `bench`:`memory_profile` | False | | Profiles memory usage of benchmark process. | |
| 80 | +| `bench`:`flush_cache` | False | | Flushes cache before every time measurement if enabled. | |
| 81 | +| `bench`:`cpu_profile` | False | | Profiles average CPU load during benchmark run. | |
| 82 | +| `bench`:`distributor` | None | None, `mpi` | Library used to handle distributed algorithm. | |
| 83 | +| `bench`:`mpi_params` | Empty dict | | Parameters for `mpirun` command of MPI library. | |
| 84 | +|<h3>Data parameters</h3>|||| |
| 85 | +| `data`:`cache_directory` | `data_cache` | | Directory path to store cached datasets for fast loading. | |
| 86 | +| `data`:`raw_cache_directory` | `data`:`cache_directory` + "raw" | | Directory path to store downloaded raw datasets. | |
| 87 | +| `data`:`dataset` | None | | Name of dataset to use from implemented dataset loaders. | |
| 88 | +| `data`:`source` | None | `fetch_openml`, `make_regression`, `make_classification`, `make_blobs` | Data source to use for loading or synthetic generation. | |
| 89 | +| `data`:`id` | None | | OpenML data id for `fetch_openml` source. | |
| 90 | +| `data`:`preprocessing_kwargs`:`replace_nan` | `median` | `median`, `mean` | Value to replace NaNs in preprocessed data. | |
| 91 | +| `data`:`preprocessing_kwargs`:`category_encoding` | `ordinal` | `ordinal`, `onehot`, `drop`, `ignore` | How to encode categorical features in preprocessed data. | |
| 92 | +| `data`:`preprocessing_kwargs`:`normalize` | False | | Enables normalization of preprocessed data. | |
| 93 | +| `data`:`preprocessing_kwargs`:`force_for_sparse` | True | | Forces preprocessing for sparse data formats. | |
| 94 | +| `data`:`split_kwargs` | Empty `dict` or default split from dataset description | | Data split parameters for `train_test_split` function. | |
| 95 | +| `data`:`format` | `pandas` | `pandas`, `numpy`, `cudf` | Data format to use in benchmark. | |
| 96 | +| `data`:`order` | `F` | `C`, `F` | Data order to use in benchmark: contiguous(C) or Fortran. | |
| 97 | +| `data`:`dtype` | `float64` | | Data type to use in benchmark. | |
| 98 | +| `data`:`distributed_split` | None | None, `rank_based` | Split type used to distribute data between machines in distributed algorithm. `None` type means usage of all data without split on all machines. `rank_based` type splits the data equally between machines with split sequence based on rank id from MPI. | |
| 99 | +|<h3>Algorithm parameters</h3>|||| |
| 100 | +| `algorithm`:`library` | None | | Python module containing measured entity (class or function). | |
| 101 | +| `algorithm`:`device` | `default` | `default`, `cpu`, `gpu` | Device selected for computation. | |
| 102 | + |
| 103 | +## Benchmark-Specific Parameters |
| 104 | + |
| 105 | +### `Scikit-learn Estimator` |
| 106 | + |
| 107 | +| Parameter keys | Default value | Choices | Description | |
| 108 | +|:---------------|:--------------|:--------|:------------| |
| 109 | +| `algorithm`:`estimator` | None | | Name of measured estimator. | |
| 110 | +| `algorithm`:`estimator_params` | Empty `dict` | | Parameters for estimator constructor. | |
| 111 | +| `algorithm`:`batch_size`:`{stage}` | None | Any positive integer | Enables online mode for `{stage}` methods of estimator (sequential calls for each batch). | |
| 112 | +| `algorithm`:`sklearn_context` | None | | Parameters for sklearn `config_context` used over estimator. | |
| 113 | +| `algorithm`:`sklearnex_context` | None | | Parameters for sklearnex `config_context` used over estimator. Updated by `sklearn_context` if set. | |
| 114 | +| `bench`:`ensure_sklearnex_patching` | True | | If True, warns about sklearnex patching failures. | |
| 115 | + |
| 116 | +### `Function` |
| 117 | + |
| 118 | +| Parameter keys | Default value | Choices | Description | |
| 119 | +|:---------------|:--------------|:--------|:------------| |
| 120 | +| `algorithm`:`function` | None | | Name of measured function. | |
| 121 | +| `algorithm`:`args_order` | `x_train\|y_train` | Any in format `{subset_0}\|..\|{subset_n}` | Arguments order for measured function. | |
| 122 | +| `algorithm`:`kwargs` | Empty `dict` | | Named arguments for measured function. | |
| 123 | + |
| 124 | +## Special Value |
| 125 | + |
| 126 | +You can define some parameters as specific from other parameters or properties with `[SPECIAL_VALUE]` prefix in string value: |
| 127 | +```json |
| 128 | +... "estimator_params": { "n_jobs": "[SPECIAL_VALUE]physical_cpus" } ... |
| 129 | +... "generation_kwargs": { "n_informative": "[SPECIAL_VALUE]0.5" } ... |
| 130 | +``` |
| 131 | + |
| 132 | +List of available special values: |
| 133 | + |
| 134 | +| Parameter keys | Benchmark type[s] | Special value | Description | |
| 135 | +|:---------------|:------------------|:--------------|:------------| |
| 136 | +| `data`:`dataset` | all | `all_named` | Sets datasets to use as list of all named datasets available in loaders. | |
| 137 | +| `data`:`generation_kwargs`:`n_informative` | all | *float* value in [0, 1] range | Sets datasets to use as list of all named datasets available in loaders. | |
| 138 | +| `bench`:`taskset` | all | Specification of numa nodes in `numa:{numa_node_0}[\|{numa_node_1}...]` format | Sets CPUs affinity using `taskset` utility. | |
| 139 | +| `algorithm`:`estimator_params`:`n_jobs` | sklearn_estimator | `physical_cpus`, `logical_cpus`, or ratio of previous ones in format `{type}_cpus:{ratio}` where `ratio` is float | Sets `n_jobs` parameter to a number of physical/logical CPUs or ratio of them for an estimator. | |
| 140 | +| `algorithm`:`estimator_params`:`scale_pos_weight` | sklearn_estimator | `auto` | Sets `scale_pos_weight` parameter to `sum(negative instances) / sum(positive instances)` value for estimator. | |
| 141 | +| `algorithm`:`estimator_params`:`n_clusters` | sklearn_estimator | `auto` | Sets `n_clusters` parameter to number of clusters or classes from dataset description for estimator. | |
| 142 | +| `algorithm`:`estimator_params`:`eps` | sklearn_estimator | `distances_quantile:{quantile}` format where quantile is *float* value in [0, 1] range | Computes `eps` parameter as quantile value of distances in `x_train` matrix for estimator. | |
| 143 | + |
| 144 | +## Range of Values |
| 145 | + |
| 146 | +You can define some parameters as a range of values with the `[RANGE]` prefix in string value: |
| 147 | +```json |
| 148 | +... "generation_kwargs": {"n_features": "[RANGE]pow:2:5:6"} ... |
| 149 | +``` |
| 150 | + |
| 151 | +Supported ranges: |
| 152 | + |
| 153 | + - `add:start{int}:end{int}:step{int}` - Arithmetic progression (Sequence: start + step * i <= end) |
| 154 | + - `mul:current{int}:end{int}:step{int}` - Geometric progression (Sequence: current * step <= end) |
| 155 | + - `pow:base{int}:start{int}:end{int}[:step{int}=1]` - Powers of base number |
| 156 | + |
| 157 | +## Removal of Values |
| 158 | + |
| 159 | +You can remove specific parameter from subset of cases when stacking parameters sets using `[REMOVE]` parameter value: |
| 160 | + |
| 161 | +```json |
| 162 | +... "estimator_params": { "n_jobs": "[REMOVE]" } ... |
| 163 | +``` |
| 164 | + |
| 165 | +--- |
| 166 | +[Documentation tree](../README.md#-documentation) |
0 commit comments