Skip to content

Eval bug: LLaVa convert_image_encoder_to_gguf.py fails to byteswap v.head.ffn_up.bias tensor on Big-Endian system #12863

Closed
@taronaeo

Description

@taronaeo

Name and Version

Bug only specific to Python code. Not C/C++ code.

$ git rev-parse HEAD
fe5b78c

Operating systems

Linux

GGML backends

CPU, BLAS

Hardware

IBM z15 8 IFLs / 64 GB RAIM / 160 GB + 500 GB DASD / NOSMT / LPAR

Models

IBM Granite Vision 3.2 2B F16 (mmproj-model-f16.gguf)

Problem description & steps to reproduce

The Problem

Using the following machines for this test:

  1. MacBook Air M3 (Little-Endian byte-order)
  2. IBM z15 Mainframe (Big-Endian byte-order)

Steps to reproduce:

  1. On both machines, pull the latest code and follow the (README-granitevision.md)[https://github.com/ggml-org/llama.cpp/blob/master/examples/llava/README-granitevision.md] instructions.
  2. On both machines, create the mmproj-model-f16.gguf file using the following command
python3 /opt/llama-testbed/examples/llava/convert_image_encoder_to_gguf.py \
  -m $ENCODER_PATH/ \
  --llava-projector $ENCODER_PATH/llava.projector \
  --output-dir $ENCODER_PATH/ \
  --clip-model-is-vision \
  --clip-model-is-siglip \
  --image-mean 0.5 0.5 0.5 \
  --image-std 0.5 0.5 0.5 \
  --bigendian
  1. Try using the mmproj-model-f16.gguf file generated by both machines, on a Big-Endian machine. Notice that the mmproj-model-f16.gguf generated by the Little-Endian machine works on Big-Endian. But the mmproj-model-f16.gguf generated by the Big-Endian machine does not work on Big-Endian.
build/bin/llama-llava-cli -m /opt/hf_models/granite-vision-3.2-2b.F16.gguf \
  --mmproj $ENCODER_PATH/mmproj-model-f16.gguf \
  --image /opt/llama-testbed/DEMO-TAX-INVOICE-PNG.png \
  -c 16384 \
  -p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat does the text in this image say?\n<|assistant|>\n" \
  --temp 0

Problem Identified

  1. Running a vimdiff against both mmproj-model-f16.gguf, I notice that the v.head.ffn_up.bias tensor is not byte-swapped correctly on Big-Endian systems, but works correctly on Little-Endian systems.
for i in {0..36176..560}; do vimdiff <(xxd -s$i -l560 mmproj-model-f16-le2be.gguf) <(xxd -s$i -l560 mmproj-model-f16.gguf); done
  1. vimdiff shows that only v.head.ffn_up.bias is not byteswapped correctly. (Left pane shows the correct byteswap; Right pane shows the incorrect byteswap)
Image

Running gguf_dump.py also shows that v.head.ffn_up.bias is the last tensor from the model file.

First Bad Commit

NIL

Relevant log output

python3 gguf_dump.py ~/Documents/hf_models/granite-vision-3.2-2b/visual_encoder/mmproj-model-f16-le2be.gguf
INFO:gguf-dump:* Loading: /Users/taronaeo/Documents/hf_models/granite-vision-3.2-2b/visual_encoder/mmproj-model-f16-le2be.gguf
* File is BIG endian, script is running on a LITTLE endian host.
* Dumping 25 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 451
      3: UINT64     |        1 | GGUF.kv_count = 22
      4: STRING     |        1 | general.architecture = 'clip'
      5: BOOL       |        1 | clip.has_text_encoder = False
      6: BOOL       |        1 | clip.has_vision_encoder = True
      7: BOOL       |        1 | clip.has_llava_projector = True
      8: UINT32     |        1 | general.file_type = 1
      9: STRING     |        1 | general.name = 'siglip-model'
     10: STRING     |        1 | general.description = 'image encoder for LLaVA'
     11: STRING     |        1 | clip.projector_type = 'mlp'
     12: UINT32     |        1 | clip.vision.image_size = 384
     13: UINT32     |        1 | clip.vision.patch_size = 14
     14: UINT32     |        1 | clip.vision.embedding_length = 1152
     15: UINT32     |        1 | clip.vision.feed_forward_length = 4304
     16: UINT32     |        1 | clip.vision.projection_dim = 0
     17: UINT32     |        1 | clip.vision.attention.head_count = 16
     18: FLOAT32    |        1 | clip.vision.attention.layer_norm_epsilon = 9.999999974752427e-07
     19: UINT32     |        1 | clip.vision.block_count = 27
     20: [INT32]    |       54 | clip.vision.image_grid_pinpoints = [384, 384, 384, 768, 384, 1152, ...]
     21: STRING     |        1 | clip.vision.mm_patch_merge_type = 'spatial_unpad'
     22: [INT32]    |        4 | clip.vision.feature_layer = [4, 8, 16, 27]
     23: [FLOAT32]  |        3 | clip.vision.image_mean = [0.5, 0.5, 0.5]
     24: [FLOAT32]  |        3 | clip.vision.image_std = [0.5, 0.5, 0.5]
     25: BOOL       |        1 | clip.use_gelu = False
* Dumping 451 tensor(s)
      1:       2048 |  2048,     1,     1,     1 | F32     | mm.0.bias
...truncated...
    451:       1152 |  1152,     1,     1,     1 | F32     | v.head.ffn_up.bias

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions