| Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B. We call it Luce DFlash (https://github.com/Luce-Org/lucebox-hub; MIT) ~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining (z-lab published a matched Qwen3.6-DFlash draft on 2026-04-26, still under training, so AL should keep climbing). If you have CUDA 12+ and an NVIDIA GPU (RTX 3090 / 4090 / 5090, DGX Spark, other Blackwell, or Jetson AGX Thor with CUDA 13+), all you need is # After cloning the repo (link in the first comment):
# Fetch target (~16 GB)
# Matched 3.6 draft is gated: accept terms + set HF_TOKEN first
# Run
That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang. The binary links libggml*.a and never libllama. Luce DFlash will
Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256:
As you can see, the speedup is real on consumer hardware, not a paper number. Target graph produces bit-identical output to autoregressive in AR mode; the draft graph matches the z-lab PyTorch reference at cos sim 0.999812. Q4_0 KV costs ~3% AL at short context (8.56 to 8.33) and wins at long context where F16 won't fit anyway. Constraints: CUDA only, greedy verify only (temperature/top_p on the OpenAI server are accepted and ignored), no Metal / ROCm / multi-GPU. Repo started single-3090, recent community PRs added support for RTX 5090, DGX Spark / GB10, other Blackwell cards, and Jetson AGX Thor (sm_110 + CUDA 13). Feedback more than welcome! [link] [comments] |