LocalLLaMA

Dual GPU llama.cpp speedup

Llama.cpp has had a long standing issue with "–split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor p…