Refresh-Scaling the Memory of Balanced Adam

arXiv:2605.10119v2 Announce Type: replace Abstract: Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $\beta_1=\beta_2$, reducing the optimizer to a single remaining parameter. However, how this parameter should be set remains poorly understood. We argue that, in balanced Adam, $\beta$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_\beta=(1-\beta)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_\beta=(1-\beta)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $\beta$ so that $R_\beta\approx1000$ selects different $\beta$ values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $\beta=0.944$, the refresh rule improves worst-case robustness, reducing the maximum relative gap in validation loss by 33.4\%, while bringing all 11 runs within 1\% of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is more naturally viewed as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.

Leave a Comment