Grafting vision onto text models for fun and profit.

So as we know.. llama.cpp separates the vision or other multimedia from the main weights. Conversely, trained model capabilities might be removed at release.

What if there was a way to put them back?

Mistral has now released both pixtral and medium vision encoders. The tokenizers of past models contain the relevant parts.

"10": { "content": "[IMG]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true },

Let's take Behemoth-X because I rather like that model.

--mmproj Pixtral-Large-Instruct-2411-hf.mmproj-f16.gguf \ --no-mmproj-offload \

It clearly sees images.. but something is broken.

https://i.ibb.co/3mTZX7Nr/bad-image.png https://i.ibb.co/V0qvvjvm/bad-image2.png

The log tells you:

[/INST]y'know what??? shut up</s>[INST][IMG_END][/INST]

Guess it wasn't trained on [IMG_END]. That's most unfortunate. But we have the source code and can edit mtmd.cpp

 } else if (proj == PROJECTOR_TYPE_PIXTRAL) { // https://github.com/huggingface/transformers/blob/1cd110c6cb6a6237614130c470e9a902dbc1a4bd/docs/source/en/model_doc/pixtral.md //img_end = "[IMG_END]"; img_end = "\n";

Alternatively the model can be reconverted to change the offending token to a different ID. Either way, it doesn't lose it's turn anymore.

https://i.ibb.co/P7x6z31/good-image2.png https://i.ibb.co/Pn29ML2/good-image.png

Is it perfect? No. Might it work better for devstral2 or some other model you want vision for? It's highly likely.

31b gemma contains the ASR parts in the tokenizer...

 "audio_token": "<|audio|>", "backend": "tokenizers", "boa_token": "<|audio>", "boi_token": "<|image>", "bos_token": "<bos>", "eoa_token": "<audio|>", "eoc_token": "<channel|>", "eoi_token": "<image|>", "eos_token": "<eos>", "eot_token": "<turn|>",

submitted by /u/a_beautiful_rhind
[link] [comments]

Leave a Comment