Experiment: Olmo 3 7B Instruct Q1_0

I tried to quantize OLMo-3 7B Instruct into a Bonsai 1-bit format. After looking into different approaches I landed on quantization aware distillation, which seemed like the most viable path to get a usable 1-bit model.

The model was trained on 4x B200 GPUs for about 12 hours. Unfortunately I had to stop way too early due to budget constraints. At this point it can produce English and some basic outputs on short sequences, but it is generally not usable. It falls into repetition loops quickly and has almost no context tracking. I believe these issues would have resolved with more training time and a better dataset choice, I picked the wrong one.

https://preview.redd.it/zm28xup2ouug1.jpg?width=2156&format=pjpg&auto=webp&s=c43b5f133acf36363ea8f5814cbd92a5d2b0fa34

For the distillation I forked the distilkit library and made some additions. It is easy to use and the repo includes scripts to export directly to GGUF. I also ran a very short DPO step afterward, there were minor improvements, or maybe not, hard to tell.

To run it you need to use the Bonsai llama.cpp fork at PrismML-Eng/Bonsai-demo since the CUDA backend has not been added to llama.cpp yet. For the distillation code see https://github.com/cturan/DistillKit (all written by AI, there may be hallucinated logic and bugs). If you have questions just ask an LLM lol.

submitted by /u/butlan
[link] [comments]

Leave a Comment