Bug
MLX VLM LoRA training fails on Qwen3-VL (Apple Silicon). The MLX data path in unsloth_zoo feeds the mlx_vlm Qwen3-VL image processor a 4-D channels-last array, but the processor expects a 3-D channels-first image, so every training step dies at:
File ".../mlx_vlm/models/qwen3_vl/processing_qwen3_vl.py", line 193, in _process_one
C, H, W = image.shape
ValueError: too many values to unpack (expected 3)
A trace of what actually reaches _process_one shows the image arriving as shape=(1, 512, 512, 3) — i.e. (batch, H, W, C), channels-last with an extra batch dim — while _process_one does C, H, W = image.shape (channels-first, no batch).
Two issues:
- Shape/layout mismatch:
unsloth_zoo passes (1, H, W, C); mlx_vlm's Qwen3-VL processor wants (C, H, W).
- Silent 512px resize: the image is
512x512 regardless of the source resolution — the MLX collate's image_size default appears hard-set to 512, with no field exposed in MLXTrainingConfig to raise it. (High-resolution document fine-tuning needs e.g. ~2-6 MP.)
Not version-specific: reproduced identically on mlx-vlm 0.5.0, 0.6.0, 0.6.1, 0.6.2 (all versions that ship models/qwen3_vl).
The model loads and get_peft_model works fine (Qwen3-VL-8B 4-bit + vision LoRA loads in ~34 GB on a 64 GB M4 Max) — only the training-step data collation is broken.
Minimal reproducible example
from unsloth import FastMLXModel, MLXTrainer, MLXTrainingConfig
from PIL import Image
model, processor = FastMLXModel.from_pretrained(
"unsloth/Qwen3-VL-8B-Instruct", load_in_4bit=True,
text_only=False, max_seq_length=4096)
model = FastMLXModel.get_peft_model(
model, r=8, finetune_vision_layers=True, use_gradient_checkpointing=True)
img = Image.new("RGB", (988, 2560)) # any image
ds = [{"messages": [
{"role": "user", "content": [
{"type": "image", "image": img},
{"type": "text", "text": "Extract the fields."}]},
{"role": "assistant", "content": [{"type": "text", "text": "{}"}]}]}]
trainer = MLXTrainer(
model=model, processor=processor,
tokenizer=getattr(processor, "tokenizer", processor),
train_dataset=ds,
args=MLXTrainingConfig(max_steps=1, output_dir="/tmp/o",
per_device_train_batch_size=1))
trainer.train() # -> ValueError: too many values to unpack (expected 3)
Environment
- macOS, Apple Silicon (M4 Max, 64 GB)
unsloth 2026.6.1, unsloth_zoo 2026.6.1
mlx 0.31.2, mlx-lm 0.31.3, mlx-vlm 0.6.2 (also 0.6.1 / 0.6.0 / 0.5.0)
transformers 5.5.0, Python 3.11
Possibly related
Whatever produces the collated image array for the VLM MLX path needs to (a) drop the batch dim / pass per-image, (b) match mlx_vlm's channels-first (C, H, W) expectation for Qwen3-VL, and (c) expose an image-resolution knob (currently pinned at 512px).
Bug
MLX VLM LoRA training fails on Qwen3-VL (Apple Silicon). The MLX data path in
unsloth_zoofeeds themlx_vlmQwen3-VL image processor a 4-D channels-last array, but the processor expects a 3-D channels-first image, so every training step dies at:A trace of what actually reaches
_process_oneshows the image arriving asshape=(1, 512, 512, 3)— i.e.(batch, H, W, C), channels-last with an extra batch dim — while_process_onedoesC, H, W = image.shape(channels-first, no batch).Two issues:
unsloth_zoopasses(1, H, W, C);mlx_vlm's Qwen3-VL processor wants(C, H, W).512x512regardless of the source resolution — the MLX collate'simage_sizedefault appears hard-set to 512, with no field exposed inMLXTrainingConfigto raise it. (High-resolution document fine-tuning needs e.g. ~2-6 MP.)Not version-specific: reproduced identically on
mlx-vlm0.5.0, 0.6.0, 0.6.1, 0.6.2 (all versions that shipmodels/qwen3_vl).The model loads and
get_peft_modelworks fine (Qwen3-VL-8B 4-bit + vision LoRA loads in ~34 GB on a 64 GB M4 Max) — only the training-step data collation is broken.Minimal reproducible example
Environment
unsloth2026.6.1,unsloth_zoo2026.6.1mlx0.31.2,mlx-lm0.31.3,mlx-vlm0.6.2 (also 0.6.1 / 0.6.0 / 0.5.0)transformers5.5.0, Python 3.11Possibly related
Whatever produces the collated image array for the VLM MLX path needs to (a) drop the batch dim / pass per-image, (b) match
mlx_vlm's channels-first(C, H, W)expectation for Qwen3-VL, and (c) expose an image-resolution knob (currently pinned at 512px).