A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
arXiv:2605.04055v1 Announce Type: new
Abstract: Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing M…