FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed

Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about.

llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA kernels landing in `mmq.cuh`, `mmvq.cu`, `convert.cu` and others.

ik_llama.cpp has had MXFP4 (`GGML_TYPE_MXFP4 = 39`) since PR #682 — the MX-standard FP4 used in gpt-oss models. Coverage is actually broader: CPU (AVX2, NEON, Zen4), CUDA, are all implemented.

They're not the same wire format — NVFP4 is Nvidia-specific E4M3 with block scaling, MXFP4 follows the MX consortium standard — but both land in the 4-bit float regime and should bring meaningful VRAM savings once model support catches up.

Verified by grepping both repos locally today.

My specs: 5090(24GB VRAM)

Go grab and play with models:
https://huggingface.co/models?num_parameters=min:0,max:64B&sort=modified&search=NVFP4

Personal favorite ones:
- Abiray-Qwen3.6-27B-NVFP4
- Qwen3-1.7B-NVFP4A16
- Qwen3.5-2B-NVFP4
- gemma-4-31B-it-NVFP4-turbo-GGUF
- Qwen3-0.6B-FP4

Exciting times for quantization.

correction: removed "Meta's"

submitted by /u/Usual-Carrot6352
[link] [comments]

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed – Finally

Leave a Comment