Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy
arXiv:2511.10909v2 Announce Type: replace-cross
Abstract: Modern AI accelerators rely on matrix multiply-accumulate units (MMAUs), such as NVIDIA Tensor Cores and AMD Matrix Cores, to accelerate deep neural network workloads. MMAUs expose only instruction-level or API-level interfaces of matrix multiply-accumulate (MMA) operations, while leaving internal floating-point arithmetic behaviors undocumented. Consequently, MMAUs across vendors and architectural generations often produce numerical discrepancies for identical inputs, and sometimes exhibit reduced numerical accuracy that can cause training instability. Diagnosing and understanding the root causes of these effects is challenging without white-box models of their arithmetic behaviors. This paper proposes closed-loop feature probing (CLFP), a generic and systematic framework for constructing complete arithmetic behavior models of MMA operations. Based on this framework, we analyze all MMA instructions on ten GPU architectures spanning from NVIDIA Volta to RTX Blackwell and from AMD CDNA1 to CDNA3, and derive the first bit-accurate arithmetic models for these MMAUs. Our models explain previously observed cross-platform numerical discrepancies and accuracy issues, enable white-box numerical error analysis, reveal four precision bottleneck designs and one numerical asymmetry design that significantly affect numerical accuracy, and provide software workarounds as well as design guidance for future MMAUs. This work is open-source on https://github.com/microsoft/MMA-Sim .