Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Just sharing here, I'm not sure whether this is suitable/useful for Local models or not.

This is by Kimi/Moonshot. Source Tweet

We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token.

This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical.

Validated on a 20x scaled-up Kimi Linear model:
✅ 1.54× throughput
✅ 64% ↓ P90 TTFT
→ Directly translating into lower token cost.

More in Prefill-as-a-Service: arxiv.org/html/2604.15039v1

submitted by /u/pmttyji
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top