Slow tok/s when offloading NVFP4 model to CPU

Title. I was messing around with Qwen3.6 35B A3B Q4_K_XL on my RTX 5070, and I got around 50 tok/s.

I then realized I could be leveraging NVFP4 on my Blackwell GPU, but I tried it and it barely reached 14tok/s. The model doesn't fit on VRAM, so I had to offload some layers to the CPU.

I am guessing NVFP4 is only fast when the model fits entirely on the GPU? If so, I'll have to wait for a decent model that fits in 12GB VRAM 😅

LMK if you've had a similar experience or I screwed up something else.

submitted by /u/6c5d1129
[link] [comments]

Leave a Comment