Skip to content

MLX VLM training broken for Qwen3-VL: processor gets 4D channels-last (1,H,W,3), expects 3D (C,H,W); also 512px hard cap #6129

@Hassan-A-K

Description

@Hassan-A-K

Bug

MLX VLM LoRA training fails on Qwen3-VL (Apple Silicon). The MLX data path in unsloth_zoo feeds the mlx_vlm Qwen3-VL image processor a 4-D channels-last array, but the processor expects a 3-D channels-first image, so every training step dies at:

File ".../mlx_vlm/models/qwen3_vl/processing_qwen3_vl.py", line 193, in _process_one
    C, H, W = image.shape
ValueError: too many values to unpack (expected 3)

A trace of what actually reaches _process_one shows the image arriving as shape=(1, 512, 512, 3) — i.e. (batch, H, W, C), channels-last with an extra batch dim — while _process_one does C, H, W = image.shape (channels-first, no batch).

Two issues:

  1. Shape/layout mismatch: unsloth_zoo passes (1, H, W, C); mlx_vlm's Qwen3-VL processor wants (C, H, W).
  2. Silent 512px resize: the image is 512x512 regardless of the source resolution — the MLX collate's image_size default appears hard-set to 512, with no field exposed in MLXTrainingConfig to raise it. (High-resolution document fine-tuning needs e.g. ~2-6 MP.)

Not version-specific: reproduced identically on mlx-vlm 0.5.0, 0.6.0, 0.6.1, 0.6.2 (all versions that ship models/qwen3_vl).

The model loads and get_peft_model works fine (Qwen3-VL-8B 4-bit + vision LoRA loads in ~34 GB on a 64 GB M4 Max) — only the training-step data collation is broken.

Minimal reproducible example

from unsloth import FastMLXModel, MLXTrainer, MLXTrainingConfig
from PIL import Image

model, processor = FastMLXModel.from_pretrained(
    "unsloth/Qwen3-VL-8B-Instruct", load_in_4bit=True,
    text_only=False, max_seq_length=4096)
model = FastMLXModel.get_peft_model(
    model, r=8, finetune_vision_layers=True, use_gradient_checkpointing=True)

img = Image.new("RGB", (988, 2560))          # any image
ds = [{"messages": [
    {"role": "user", "content": [
        {"type": "image", "image": img},
        {"type": "text", "text": "Extract the fields."}]},
    {"role": "assistant", "content": [{"type": "text", "text": "{}"}]}]}]

trainer = MLXTrainer(
    model=model, processor=processor,
    tokenizer=getattr(processor, "tokenizer", processor),
    train_dataset=ds,
    args=MLXTrainingConfig(max_steps=1, output_dir="/tmp/o",
                           per_device_train_batch_size=1))
trainer.train()   # -> ValueError: too many values to unpack (expected 3)

Environment

  • macOS, Apple Silicon (M4 Max, 64 GB)
  • unsloth 2026.6.1, unsloth_zoo 2026.6.1
  • mlx 0.31.2, mlx-lm 0.31.3, mlx-vlm 0.6.2 (also 0.6.1 / 0.6.0 / 0.5.0)
  • transformers 5.5.0, Python 3.11

Possibly related

Whatever produces the collated image array for the VLM MLX path needs to (a) drop the batch dim / pass per-image, (b) match mlx_vlm's channels-first (C, H, W) expectation for Qwen3-VL, and (c) expose an image-resolution knob (currently pinned at 512px).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions