Title. I was messing around with Qwen3.6 35B A3B Q4_K_XL on my RTX 5070, and I got around 50 tok/s.
I then realized I could be leveraging NVFP4 on my Blackwell GPU, but I tried it and it barely reached 14tok/s. The model doesn't fit on VRAM, so I had to offload some layers to the CPU.
I am guessing NVFP4 is only fast when the model fits entirely on the GPU? If so, I'll have to wait for a decent model that fits in 12GB VRAM 😅
LMK if you've had a similar experience or I screwed up something else.
[link] [comments]