Skip to content

Possible NVFP4 Loading Issue with Qwen3.6-35B-A3B-NVFP4 #6224

@YuDsheng

Description

@YuDsheng

Qwen3.6-35B-A3B-NVFP4 Vision model loads with MISSING MoE weights and generates corrupted multilingual output

1. Did you update? pip install --upgrade unsloth unsloth_zoo

Yes.

Studio version:

v0.1.45-beta

Package version:

2026.6.2

2. Colab or Kaggle or local / cloud

Local Linux server.


3. Number GPUs used, use nvidia-smi

1 GPU

Hardware detected:

NVIDIA GB10
Max memory: 121.69 GB

During generation:

GPU Memory Usage: ~65-80GB
GPU Utilization: ~90-95%

4. Which notebook? Please link!

Unsloth Studio local deployment.

Not using a public notebook.


5. Which Unsloth version, TRL version, transformers version, PyTorch version?

From logs:

Unsloth 2026.6.2
Transformers 5.5.0
Torch 2.10.0+cu130
CUDA Toolkit 13.0
Triton 3.6.0

Additional package versions if needed:

pip show unsloth
pip show unsloth_zoo
pip show transformers
pip show trl
pip show compressed-tensors
python -c "import torch; print(torch.__version__)"

6. Which trainer? SFTTrainer, GRPOTrainer etc

No training.

Inference only.


Model

model_name = "unsloth/Qwen3.6-35B-A3B-NVFP4"

Environment Notes

During startup I see:

Your Flash Attention 2 installation seems to be broken.
Using Xformers instead.

and

The fast path is not available because one of the required library is not installed.
Falling back to torch implementation.

I understand this may affect performance, but I do not think it explains the corrupted outputs described below.


Model Detection

The model is detected as:

model_type=qwen3_5_moe
architectures=['Qwen3_5MoeForConditionalGeneration']
is_vision=True

Therefore Studio correctly identifies it as a Vision model.


Loading Report

After loading weights successfully, I get the following report:

Qwen3_5MoeForConditionalGeneration LOAD REPORT from: unsloth/Qwen3.6-35B-A3B-NVFP4

Key                                                                      | Status
-------------------------------------------------------------------------+-----------
model.layers.{0...39}.mlp.experts.{0...255}.down_proj.input_global_scale | UNEXPECTED
model.layers.{0...39}.mlp.experts.{0...255}.up_proj.input_global_scale   | UNEXPECTED
model.layers.{0...39}.mlp.experts.{0...255}.gate_proj.input_global_scale | UNEXPECTED
model.layers.{0...39}.mlp.experts.down_proj_scale                        | UNEXPECTED
model.layers.{0...39}.mlp.experts.down_proj_global_scale                 | UNEXPECTED
model.layers.{0...39}.mlp.experts.down_proj_packed                       | UNEXPECTED
model.layers.{0...39}.mlp.experts.gate_up_proj_packed                    | UNEXPECTED
model.layers.{0...39}.mlp.experts.gate_up_proj_scale                     | UNEXPECTED
model.layers.{0...39}.mlp.experts.gate_up_proj_global_scale              | UNEXPECTED

model.language_model.layers.{0...39}.mlp.experts.gate_up_proj            | MISSING
model.language_model.layers.{0...39}.mlp.experts.down_proj               | MISSING

The loader also reports:

MISSING: those params were newly initialized because missing from the checkpoint.
Consider training on your downstream task.

Observed Behavior

The model loads successfully:

Successfully loaded model: unsloth/Qwen3.6-35B-A3B-NVFP4

and memory usage appears reasonable:

GPU Memory [After loading]
64.53GB / 121.69GB

However, generation behavior is abnormal.


Reproduction

Prompt:

你是什么模型?

No images attached.

Generation starts normally:

Starting text generation

and eventually finishes:

Finished text generation

Generation duration:

324.47 seconds

Actual Output

The model produces corrupted multilingual text consisting of random fragments from multiple languages:

对她只开始...
tail突然还是个...
stro Collaboration...
organis橱窗...
总书记不愿...
double bouncing...
[blocked]

The output is completely unrelated to the prompt.

The response contains:

  • Chinese fragments
  • English fragments
  • Arabic fragments
  • Japanese fragments
  • Random tokens
  • Broken words
  • Repeated token patterns

It appears similar to a corrupted decode or partially initialized model rather than a normal language model response.


Why I Suspect a Loading Issue

The model:

  • Loads successfully
  • Consumes expected GPU memory
  • Completes generation

However:

  • Generation takes over 5 minutes for a trivial prompt
  • Output is nonsensical
  • Output is unrelated to the prompt
  • Load report shows many MoE expert weights as MISSING

The suspicious part is:

UNEXPECTED:
*_packed
*_scale
*_global_scale

MISSING:
gate_up_proj
down_proj

which appear to be core MoE expert projection layers.

This makes me wonder whether:

  1. NVFP4 packed expert weights are not being correctly mapped.

  2. Some expert layers are being randomly initialized.

  3. There is a compatibility issue between:

    • Qwen3.6-35B-A3B-NVFP4
    • Transformers 5.5.0
    • Unsloth Studio 2026.6.2
  4. The checkpoint format is not being fully restored during loading.


Questions

  1. Are these UNEXPECTED and MISSING entries expected for this checkpoint?

  2. Should the following be present after loading?

gate_up_proj
down_proj
  1. Are NVFP4 packed tensors:
*_packed
*_scale
*_global_scale

supposed to be automatically converted into standard expert projection layers?

  1. Could these missing expert projections explain the corrupted multilingual output?

  2. Is there a known issue with:

unsloth/Qwen3.6-35B-A3B-NVFP4

under:

Unsloth Studio v0.1.45-beta
Unsloth 2026.6.2
Transformers 5.5.0
  1. Is this load report expected, or does it indicate an architecture / checkpoint compatibility problem?

Thanks for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions