MachineLearning

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

cuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected. I tested with the latest C…