TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled. No option did this out of the box for me without VRAM and/or tooling issues (this was before MTP PR for llama.cpp surfaced there).
So I pulled out an old trick: stay up to 4 a.m. one too many times to do month+ work in a week or two. I probably lost a decent amount of hair while trying to make this all work, but now I have what seems to be a proper solution and don't mind to share.
Anbeeld's BeeLlama.cpp
https://preview.redd.it/lqjgiw1bx40h1.jpg?width=1800&format=pjpg&auto=webp&s=3b68c16e78d36a1089a14f31b338aa78b8a1c073
GitHub repo: https://github.com/Anbeeld/beellama.cpp
BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.
Not quite a pegasus, but close enough.
Here's a plug-and-play Qwen 3.6 27B setup with a config to run it in Q5 + 200k of practically lossless KV cache + vision on a single RTX 3090 or 4090.
Fork Features
- DFlash speculative decoding:
--spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification. - TurboQuant / TCQ KV-cache compression: Five cache types (
turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v. - Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed
--spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth. - Full multimodal support: When
--mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure. - Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is
force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available. - Sampled DFlash verification:
--spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output. - DDTree branch verification: optional
--spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress! - Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
- CopySpec model-free speculation:
--spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model.
For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.
TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).
submitted by