Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs
arXiv:2603.26828v1 Announce Type: cross
Abstract: Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit additio…