Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Been comparing how different routing layers handle K2.6 this week, OpenRouter, Together, Orq, and while digging around I came across FlashKDA which Moonshot dropped alongside the K2.6 activity. Seems to be flying under the radar, sharing here because the kernel work is genuinely interesting on its own, separate from the model release.

What it is. A CUTLASS C++ implementation of the forward kernel for Kimi Delta Attention, the linear attention variant from the Kimi Linear paper. It plugs into flash-linear-attention as a backend through FLA pull request #852, so anyone already using FLA for KDA based models can route through FlashKDA at the backend layer.

Numbers from their H20 benchmark, measured against FLA's existing Triton path:

At T=8192, H=96, D=128, fixed length sequences, 1.72x. Variable length with mixed seq_lens, 1.95x. Variable length with uniform 1024x8, 2.22x.

Why this matters. Linear attention architectures like KDA promise linear scaling with sequence length, but the promise only holds if the kernel implementation is actually hardware efficient. FLA's Triton path is the reference and it works, but CUTLASS tuned for Hopper memory access patterns is how you close the gap between the theoretical cost model and what you see on a real GPU.

Requirements are SM90 and above, CUDA 12.9 and above, PyTorch 2.4 and above. MIT licensed.

One honest limitation worth flagging, the benchmark is forward pass only and all numbers are on H20. H20 is the China specific Hopper variant so absolute numbers on H100 or Blackwell will differ. The relative speedup should be directionally similar but nobody has posted those numbers yet.

Curious whether anyone on here has tested it on H100, or has thoughts on when a backward pass kernel might land. The forward only story limits the training use case right now.

submitted by /u/Cosmicdev_058
[link] [comments]

Leave a Comment