Bench 2xMI50 Qwen3.5 27b vs Gemma4 31B (vllm-gfx906-mobydick)

Inference engine used (vllm fork): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main

Huggingface Quants used: QuantTrio/Qwen3.5-27B-AWQ vs cyankiwi/gemma-4-31B-it-AWQ-4bit

Relevant commands to run:

docker run -it --name vllm-gfx906-mobydick -v ~/llm/models:/models --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/vllm-gfx906-mobydick:latest

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /models/gemma-4-31B-it-AWQ-4bit \ --served-model-name gemma-4-31B-it-AWQ-4bit \ --dtype float16 \ --max-model-len auto \ --gpu-memory-utilization 0.95 \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --mm-processor-cache-gb 1 \ --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --limit-mm-per-prompt.audio=1 --skip-mm-profiling \ --tensor-parallel-size 2 \ --async-scheduling \ --host 0.0.0.0 \ --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /models/Qwen3.5-27B-AWQ \ --served-model-name Qwen3.5-27B-AWQ \ --dtype float16 \ --enable-log-requests \ --enable-log-outputs \ --log-error-stack \ --max-model-len auto \ --gpu-memory-utilization 0.98 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \ --mm-processor-cache-gb 1 \ --limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \ --tensor-parallel-size 4 \ --host 0.0.0.0 \ --port 8000 2>&1 | tee log.txt

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \ --dataset-name random \ --random-input-len 5000 \ --random-output-len 500 \ --num-prompts 4 \ --request-rate 10000 \ --ignore-eos 2>&1 | tee logb.txt

RESULTS GEMMA 4 31B AWQ

============ Serving Benchmark Result ============ Successful requests: 4 Failed requests: 0 Request rate configured (RPS): 10000.00 Benchmark duration (s): 106.54 Total input tokens: 20000 Total generated tokens: 2000 Request throughput (req/s): 0.04 Output token throughput (tok/s): 18.77 Peak output token throughput (tok/s): 52.00 Peak concurrent requests: 4.00 Total token throughput (tok/s): 206.49 ---------------Time to First Token---------------- Mean TTFT (ms): 42848.83 Median TTFT (ms): 43099.40 P99 TTFT (ms): 65550.49 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 127.20 Median TPOT (ms): 126.72 P99 TPOT (ms): 173.17 ---------------Inter-token Latency---------------- Mean ITL (ms): 127.20 Median ITL (ms): 81.59 P99 ITL (ms): 85.56 ==================================================

RESULTS QWEN3.5 27B AWQ

============ Serving Benchmark Result ============ Successful requests: 4 Failed requests: 0 Request rate configured (RPS): 10000.00 Benchmark duration (s): 51.18 Total input tokens: 20000 Total generated tokens: 2000 Request throughput (req/s): 0.08 Output token throughput (tok/s): 39.08 Peak output token throughput (tok/s): 28.00 Peak concurrent requests: 4.00 Total token throughput (tok/s): 429.89 ---------------Time to First Token---------------- Mean TTFT (ms): 24768.32 Median TTFT (ms): 25428.47 P99 TTFT (ms): 35226.79 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 49.20 Median TPOT (ms): 46.08 P99 TPOT (ms): 72.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 269.04 Median ITL (ms): 154.46 P99 ITL (ms): 2969.67 ---------------Speculative Decoding--------------- Acceptance rate (%): 89.70 Acceptance length: 5.48 Drafts: 365 Draft tokens: 1825 Accepted tokens: 1637 Per-position acceptance (%): Position 0: 91.23 Position 1: 90.14 Position 2: 89.86 Position 3: 89.04 Position 4: 88.22 ==================================================

FINAL NOTES :

As expected Qwen3.5 is faster thanks to MTP 5 and its archicture+size (note that i also use a awq quant with group size 128 for it vs 32 for gemma4). But it will generate much more thinking tokens than Gemma4 so overall, it can be slower.

In my agentic use cases, Qwen3.5 stays also slightly better than Gemma4.

submitted by /u/ai-infos
[link] [comments]

Leave a Comment