Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
arXiv:2605.02043v1 Announce Type: new
Abstract: Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and stalene…