/u/NoVibeCoding - Provide.ai

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

/u/NoVibeCoding / April 10, 2026

cuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected. I tested with the latest C…

Author name: /u/NoVibeCoding

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]