ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

Improves the speed of Mistral Small 4 on CUDA

(there was a CPU fallback before)

(I wonder if it’s somehow related to the upcoming Mistral model? Maybe not)

submitted by /u/jacek2023
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top