I have been trying Mistral 3.5 on my 4x RTX 3090 rig with llama.cpp. Inference is slow (about 11 t/s) even without anything being offloaded to the CPU. Here is the llama-server command I used:
./llama-server --model ../downloaded_models/Mistral-Medium-3.5-128B-UD-Q4_K_XL-00001-of-00003.gguf --port 11433 --host 0.0.0.0 --temp 0.7 --jinja -fa on --chat-template-kwargs '{"reasoning_effort":"none"}' llama.cpp automatically set a context window size of about 44000 tokens to fit the computation entirely on the GPUs.
A while ago I tested Qwen 3.5 27b with vLLM and got impressed by the speed boost I got compared to llama.cpp (can't remember the exact numbers, but it was like 2~3x faster). However, the VRAM usage was way higher.
I am a complete noob when it comes to vLLM, so my question is: is it possible to run a quantized version of a big model such as Mistral 3.5 using vLLM on my current hardware configuration with a decent context size? Is there a way to predict the speed x VRAM requirements tradeoff between llama.cpp and vLLM?
[link] [comments]