mtmd : refactor llava-uhd preprocessing logic #14247

ngxson · 2025-06-17T17:37:18Z

This code block may requires even more refactoring in the future, but I will do it when I get more time to spend.

Explanation: Most vision models only support fixed (square) input image, but in reality image can comes in different sizes and aspect ratios.

The idea here is to split a big image into smaller "slices" using heuristic algorithm, then each slice will be in the "native" resolution that vision encoder can process. So I actually the code to be super simple to understand. However, the current code looks quite complicated.

Selects the best resolution from a list of possible resolutions based on the original size.

For example, when given a list of resolutions:

100x100

200x100

100x200

200x200

And an input image of size 111x200, then 100x200 is the best fit (least wasted resolution).

Test result:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

ngxson · 2025-06-18T07:53:40Z

Confirmed to fix the issue: #13827 (comment)

mtmd : refactor llava-uhd preprocessing logic

d919c31

github-actions bot added the examples label Jun 17, 2025

fix editorconfig

4aa20cd

ngxson mentioned this pull request Jun 17, 2025

Eval bug: Llama 4 Scout/Maverick crash when processing images with certain aspect ratio #13827

Closed

ngxson marked this pull request as ready for review June 18, 2025 07:53

ngxson requested a review from ggerganov June 18, 2025 07:53

ggerganov approved these changes Jun 18, 2025

View reviewed changes

ngxson merged commit 413977d into ggml-org:master Jun 18, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : refactor llava-uhd preprocessing logic #14247

mtmd : refactor llava-uhd preprocessing logic #14247

ngxson commented Jun 17, 2025 •

edited

Loading

Uh oh!

ngxson commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

mtmd : refactor llava-uhd preprocessing logic #14247

mtmd : refactor llava-uhd preprocessing logic #14247

Conversation

ngxson commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

ngxson commented Jun 17, 2025 •

edited

Loading