Description
Research Stage
- Background Research (Let's try to avoid reinventing the wheel)
- Hypothesis Formed (How do you think this will work and it's effect?)
- Strategy / Implementation Forming
- Analysis of results
- Debrief / Documentation (So people in the future can learn from us)
Previous existing literature and research
I'm trying to deploy a multi-modal model based on VITA-1.5, where:
The text backbone is the same as Qwen2.
The vision tower is InternViT-300M-448px from OpenGVLab.
Yesterday I noticed that convert_hf_to_gguf.py added a new class:
class InternVisionModel(VisionModel)
which is the same one used in vita's vision part
However:
There's no corresponding tensor name mapping in constants.py under MODEL_TENSORS.
There's no build function in llama_model.cpp (e.g., no build_internvit() ).
I’m not sure how to combine the vision and text parts into a single GGUF model so that llama.cpp can infer with both modalities.
My goal:
To deploy VITA-1.5 via llama.cpp and run image+text inference (similar to LLaVA / MobileVLM).
Questions:
What is the recommended way to combine Qwen2 text + InternViT vision into one GGUF model?
Will InternViTVisionModel support GGUF inference soon, or should I write the corresponding GGML graph manually?
Hypothesis
No response
Implementation
No response
Analysis
No response