Should I be seeing more of a performance leap when using NVFP4, INT4, FP8 with VLLM over MXFP4, Q4, and Q8 with llama.cpp based inference on Blackwell based GPUs?

I hope I am doing something wrong here but I am seeing about almost double the t/s using LM studio with Qwen3.5 and Nemotron models than I am seeing with Nvidia’s own vLLM containers built for Spark.

I was surprised I was only getting 15-ish t/s with Nemotron Nano NVFP4 in VLLM with Nvidia’s recommended settings and getting 30 t/s using Unsloths MXFP4 of Nemotron Nano in LM Studio.

I have two RTX Pro 6000’s. One is dedicated to Ollama for on demand switching, the other is dedicated to a single model running with, vLLM.

I get 40+ t/s using Mistral Small 3 24B Q8 in Ollama and around 20-30 t/s with Qwen3.5 27B FP8.

Plus the models load in LM-Studio 10x as fast. Seriously, VLLM takes like 10-15 minutes to load a model. LM-Studio and Ollama are about 90 seconds for the larger ones like Qwen3.5 122B and Devstral 2 123B.

One thing VLLM does have going for it is being able to take advantage of multi-token prediction, and this brings up to par with running with llama.cpp inference.

I would really like to see the performance benefit of taking advantage of the native 4bit cores in the Blackwell architectures but I am not seeing it.

Note: I am not as much of a fan of Ollama as the next guy but when I was first building a setup for a small team it just worked and I could set it up with a couple models and forget about it. Plus llama.cpp, Ollama and LM-Studio allow you to load multiple models on a single GPU where VLLM does. It support this withoutout additional config of Nvidia/docker GPU sharing.

submitted by /u/aaronr_90
[link] [comments]

Leave a Comment