Differential mode for llama-bench + plotting code

@slaren

I think it would be useful if there was a way to more easily compare the outputs of llama-bench as a function of context size and would therefore want to implement such a feature. What I'm imagining is something like a --differential flag which, when set, provides separate numbers for each individual model evaluation in a benchmark run instead of one number for all evaluations as a whole.

So for example with ./llama-bench -r 1 -d 1024 -n 4 -p 64 -ub 16 --differential I'm imagining something like this:

| model         | size       | params     | backend    | ngl | n_ubatch | test            | t/s                  |
| ------------- | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1024    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1040    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1056    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | pp16 @ d1072    | 1115.41 ± 0.00       |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1024     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1025     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1026     | 115.22 ± 0.00        |
| llama 8B Q4_0 | 4.33 GiB   | 8.03 B     | CUDA       |  99 |       16 | tg1 @ d1027     | 115.22 ± 0.00        |

You could in principle already do something like this by just invoking llama-bench multiple times but it's kind of inconvenient.

Because reading differential data from a table is difficult I would also be adding code to plot the t/s as a function of the depth using matplotlib. I would add plotting code to compare-llama-bench.py but because a lot of the time people want just the performance for a single commit I would also add a simplified plotting script which just reads in one or more CSV tables and plots the contents of all tables in a single figure.

It would in principle also be possible to add code for fitting a polynomial to the runtime as a function of depth but doing this correctly is non-trivial. It would be possible to do this with comparatively low effort by using kafe2 but that project is licensed under the GPLv3. But I think a statistical analysis of the performance differences (and how they would extrapolate to higher context sizes) will not be needed anyways so it makes more sense to keep it simple and just do a plot.

@slaren since you are probably the biggest stakeholder for llama-bench, does the feature as described here sound useful to you? Do you have suggestions for changes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Differential mode for llama-bench + plotting code #13408

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Differential mode for llama-bench + plotting code #13408

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions