| Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short. We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter. Repo: github.com/Luce-Org/lucebox-hub (open source, MIT). Head-to-head on Qwen3.6-27B Q4_K_M, RTX 3090, single-shot: 24.8 s TTFT vs ~257 s for vanilla llama.cpp = ~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop. The problem Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context. Standing on shoulders This work stands on two recent papers, both excellent reads:
Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before. What we built
Setup bash Numbers Single-shot on RTX 3090, Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05. NIAH single-needle as the end-to-end retrieval check. Baseline is vanilla llama.cpp with default f16 KV (apples-to-oranges on KV; q4_0 KV costs ~3% AL at short context, 8.56 to 8.33, benchmarked).
These are cold-cache numbers (first request after process boot). Warmed-vs-warmed is a smaller multiplier because llama.cpp settles into ~169 s at 128K once caches are hot. Both numbers are real and the right one depends on your workload; if you keep an engine resident, use warmed. Decode after prefill is the standard DFlash spec-decode path with DDTree (~74 tok/s sustained on Qwen3.6-27B Q4_K_M). Quality NIAH single-needle (magic-key + 7-digit answer randomly placed in filler) retrieved at every context tested from 32K through 128K, keep_ratio=0.05, DFLASH_FP_ALPHA=0.85. Honest flag: NIAH single-needle is a structurally easy probe for an attention-based selection method like ours, since the algorithm is well-suited to finding a single high-attention span. RULER and NIAH multi-needle are next on the list; a fair audit should wait for those numbers. Why the stack works Speculative prefill solves a quality problem: how do you compress without losing the answer-relevant content? FlashPrefill solves a speed problem inside the drafter step: how do you make the drafter fast enough at 128K that it doesn't become the bottleneck. They compose cleanly because the target side (DFlash spec decode) is unchanged; it just receives a much shorter prompt with full attention enabled. At 128K, drafter scoring is now the dominant cost (~12 s of the 24.8 s TTFT). Target prefill on the compressed ~6.5K survivors is ~10 s; the remaining ~3 s is the park/unpark/free dance. The next obvious lever is a smaller or distilled drafter, which we have not done yet. Tuning bash keep_ratio=0.05 is the default. 0.02 cuts target prefill from ~10 s to ~3 s but starts losing the needle. DFLASH_FP_ALPHA=0.99 cuts ~1 s at 128K with a small NIAH-margin loss. Calibration territory. Any feedback is more than welcome! [link] [comments] |