mtmd : refactor llava-uhd preprocessing logic #14247
Merged
+111
−81
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix #13827
This code block may requires even more refactoring in the future, but I will do it when I get more time to spend.
Explanation: Most vision models only support fixed (square) input image, but in reality image can comes in different sizes and aspect ratios.
The idea here is to split a big image into smaller "slices" using heuristic algorithm, then each slice will be in the "native" resolution that vision encoder can process. So I actually the code to be super simple to understand. However, the current code looks quite complicated.
Test result: