I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then.
The project is now a more structured benchmark/recipe repo rather than scattered notes. It has a static results explorer, schema-validated benchmark JSON, clearer llama.cpp/vLLM notes, single-card and dual-card RTX 5060 Ti recipes, a model-agnostic download helper, and better labels for generation speed, prompt eval speed, MTP/no-MTP, and thinking mode.
Repo: https://github.com/5p00kyy/club-5060ti
Results explorer: https://5p00kyy.github.io/club-5060ti/
The tested baseline is still RTX 5060 Ti 16GB, especially 2x 5060 Ti for the larger Qwen3.6 runs. I do not want to imply the numbers are universal. The useful part is the recipe shape and reporting discipline: exact hardware, runtime, model, quant, context, KV cache, generated tokens, prompt eval speed, generation speed, and caveats.
One thing that came up in comments was using different GPU architectures together. My current read is that llama.cpp/GGUF is the best first thing to test on non-5060 Ti or mixed-GPU setups. vLLM NVFP4/MTP is more Blackwell-specific and should not be assumed to work unchanged on other architectures.
Mixed-card and non-5060 Ti results are welcome, but they should be reported as their own hardware lane rather than blended into the 2x 5060 Ti baseline.
What would be useful from other people:
• dual 5060 Ti results from different CPUs/motherboards
• mixed-GPU and non-5060 Ti llama.cpp results
• vLLM version drift reports
• clear failure reports, not only successful runs
Some older llm-bench rows have been imported as archived historical data so they are not lost, but I am treating club-5060ti as the new source of truth. The plan is to rerun useful cases under the new benchmark protocol rather than relying on old mixed-method results.
https://github.com/5p00kyy/llm-bench is effectively being folded into this project as the results/data side of club-5060ti, instead of staying as a separate older benchmark repo.
If you test something, please include the boring details. Those are what make the results useful.
Edit:
Small update after some feedback: I’ve adjusted the repo framing so it is less tied to my exact 2x RTX 5060 Ti setup.
The project is meant to be a broader RTX 5060 Ti local inference hub, split into clear hardware lanes:
• 1x RTX 5060 Ti • 2x RTX 5060 Ti • 3x/4x+ RTX 5060 Ti • mixed RTX 5060 Ti + other CUDA GPUs • other CUDA GPU comparison/adaptation results
That should make quad-card setups, single-card setups, and mixed systems useful without pretending they are directly comparable to each other. The repo now has a hardware-lanes doc and the result submission templates ask people to label the lane and include topology/runtime/model/benchmark details.
[link] [comments]