Special thanks for u/Sea-Speaker1700 to make possible run mxfp4 on R0700 GPU, first guide to run 122B models here
Well, 397B model works amazing, super fast.
Use this Dockerfile to build image, original image provided by u/Sea-Speaker1700
FROM tcclaviger/vllm-rocm-rdna4-mxfp4:latest # Transformers Update RUN pip install --upgrade transformers # Triton Patch RUN find /app -name "topk.py" -exec grep -l "N_EXPTS_ACT=k," {} \; | xargs -I{} sed -i 's/N_EXPTS_ACT=k, # constants/N_EXPTS_ACT=__import__("triton").next_power_of_2(k), # constants/' {} CMD ["/bin/bash"]
build patched version
docker build -t vllm-mxfp4-patched -f Dockerfile .
Download model:
git lfs clone https://huggingface.co/djdeniro/Qwen3.5-397B-A17B-MXFP4
Launch script, keep your device id, replace $1 with model name, $2 with out port.
docker run --name "$1" \ --rm --tty --ipc=host --shm-size=32g \ --device /dev/kfd:/dev/kfd \ --device /dev/dri/renderD128:/dev/dri/renderD128 \ --device /dev/dri/renderD129:/dev/dri/renderD129 \ --device /dev/dri/renderD130:/dev/dri/renderD130 \ --device /dev/dri/renderD131:/dev/dri/renderD131 \ --device /dev/dri/renderD132:/dev/dri/renderD132 \ --device /dev/dri/renderD137:/dev/dri/renderD137 \ --device /dev/dri/renderD138:/dev/dri/renderD138 \ --device /dev/dri/renderD139:/dev/dri/renderD139 \ --device /dev/mem:/dev/mem \ -e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ -v /mnt/llm_disk/models:/app/models:ro \ -e TRUST_REMOTE_CODE=1 \ -e OMP_NUM_THREADS=8 \ -e PYTORCH_TUNABLEOP_ENABLED=1 \ -e PYTORCH_TUNABLEOP_TUNING=0 \ -e PYTORCH_TUNABLEOP_RECORD_UNTUNED=0 \ -e VLLM_ROCM_USE_AITER=0 \ -e PYTORCH_TUNABLEOP_FILENAME=/tunableop/tunableop_merged.csv \ -e PYTORCH_TUNABLEOP_UNTUNED_FILENAME=/tunableop/tunableop_untuned%%d.csv \ -e GPU_MAX_HW_QUEUES=1 \ -p "$2":8000 \ -e TRITON_CACHE_DIR=/root/.triton/cache \ vllm-mxfp4-patched \ /app/models/Qwen3.5-397B-A17B-MXFP4 \ --served-model-name "$1" --host 0.0.0.0 --port 8000 --trust-remote-code \ --enable-prefix-caching --gpu-memory-utilization 0.98 --tensor-parallel-size 8 \ --max-model-len 131072 --max-num-seqs 4 \ --tool-call-parser qwen3_coder --enable-auto-tool-choice \ --override-generation-config '{"max_tokens": 64000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' \ --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64, 128], "max_cudagraph_capture_size": 128}' \ --max-num-batched-tokens 2048 \ --limit-mm-per-prompt.image 2 --mm-processor-cache-gb 1 \ --mm-processor-kwargs '{"max_pixels": 602112}' \ --reasoning-parser qwen3
Loading model 400-600s first time, and then got 30 t/s on tg, 3.5-3.7k on pp in one request.
in 4x requests you will got up to 100 t/s.
I limit power per gpu (210W), if power limit 300W per gpu will speedup model.
Best result with this model i have when thinking budget is 0 tokens for coding tasks.
submitted by