Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
arXiv:2604.00726v1 Announce Type: new
Abstract: As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware…