ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

By /u/jacek2023 / April 28, 2026

ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

Improves the speed of Mistral Small 4 on CUDA

(there was a CPU fallback before)

(I wonder if it’s somehow related to the upcoming Mistral model? Maybe not)

submitted by /u/jacek2023
[link] [comments]

Leave a Comment