I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of different sizes and architectures compare on my system. In every model that I have tested, I run into a wall around 64k tokens context. TTFT, TG and PP would all fall on their face at long context lengths.
So this past weekend I rented a MI300X from RunPod thinking that AMD must have this issue sorted on CDNA. When loading up vLLM with Qwen3.6-27B-FP8 I noticed that vLLM was selecting ROCm Attention instead of one of the AITER attention backends which I though was strange, but I pushed on with my benchmarking runs. After a run of llama-benchy I saw that the MI300X had the same issue that my R9700s do at long context lengths. At >64k context my TG/s would fall to single digit numbers. This prompted me to go searching for an AMD runbook on running vLLM on the MI300X and found that the AITER attention mechanisms are gated behind an env var that you have to explicitly enable. With this new found information, I went back to trying to patch vLLM and AITER support for gfx1201.
I already have a patched version of vLLM that that I build to bring FP8 support to the R9700 which is built ontop of the AITER Triton kernels. I had some issues when I was first patching in AITER support so I disabled everything but the Triton kernels in order to get FP8 working. Most of the patching for AITER and vLLM just require removing gates that block gfx1201, or adding that architecture to wherever you see MI350X (my understanding is that the MI350X and RDNA4 implement FP8 in the same or very similar way to the point that you can use some of the MI350X kernels on RDNA4). All of my testing was done around Qwen3.6 27B since this model finally gives us close to SOTA performance at home. Being that Qwen3.6 is a hybrid architecture, it kept crashing the AITER Unified Attention due to a mismatch in expected TILE_SIZE, something about AITER only supports kv block sizes that are a power of two.
The main downside I have found so far, if you can call it that, is that you can only run FP16/BF16 KV Cache. Not that you would need to quantize your cache with the Qwen3.6 family since its cache footprint is already tiny. But just something to be aware of if you do decide to try it out.
I have attached some of my benchmark runs of Qwen3.6 on my R9700s and the MI300X I rented. I have not been able to rent a MI300X from runpod again to test with AITER Attention since there has been no availability the past few days. Im sorry that there is no pre-aiter benchmark, I seem to have overwritten that benchmark while I was troubleshooting. I do have my original benchmarks from Qwen3.6 35B that I will attach. I have also attached a benchmark with MTP enabled and set to 3 tokens, as far as I can tell for single concurrency, it is free performance. At high context on concurrency 2, the TG performance drops off pretty sharply at high context depths. The llama-benchy runs are TG128 and PP2048 at each of the context depths.
[link] [comments]