Skip to content

Research: How to integrate VITA 1.5 for multi-modal GGUF deployment? #13520

Open
@jordanqi

Description

@jordanqi

Research Stage

  • Background Research (Let's try to avoid reinventing the wheel)
  • Hypothesis Formed (How do you think this will work and it's effect?)
  • Strategy / Implementation Forming
  • Analysis of results
  • Debrief / Documentation (So people in the future can learn from us)

Previous existing literature and research

I'm trying to deploy a multi-modal model based on VITA-1.5, where:

The text backbone is the same as Qwen2.

The vision tower is InternViT-300M-448px from OpenGVLab.

Yesterday I noticed that convert_hf_to_gguf.py added a new class:

class InternVisionModel(VisionModel)

which is the same one used in vita's vision part
However:

There's no corresponding tensor name mapping in constants.py under MODEL_TENSORS.

There's no build function in llama_model.cpp (e.g., no build_internvit() ).

I’m not sure how to combine the vision and text parts into a single GGUF model so that llama.cpp can infer with both modalities.

My goal:
To deploy VITA-1.5 via llama.cpp and run image+text inference (similar to LLaVA / MobileVLM).

Questions:
What is the recommended way to combine Qwen2 text + InternViT vision into one GGUF model?

Will InternViTVisionModel support GGUF inference soon, or should I write the corresponding GGML graph manually?

Hypothesis

No response

Implementation

No response

Analysis

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions