Immediate Derivatives Suffice for Online Recurrent Adaptation

arXiv:2603.28750v3 Announce Type: replace Abstract: For three decades online recurrent learning has been assumed to require propagating a Jacobian tensor through the network's dynamics at $O(n^4)$ per step. We show it doesn't. Dropping the propagation entirely ($d=0$, $O(n^2)$ memory) matches full RTRL within CI on held-out BCI cross-session drift (TOST equivalent within $\pm 3$ pp at $n=20$, Adam, float64), and across vanilla-RNN synthetic cells (sine and Lorenz under Adam and SGD) and LSTM/sine under Adam. A decomposition $g_{RTRL} = g_{imm} + g_{past}$ explains why. On BCI, $g_{past}$ concentrates in a single direction (top-1 singular fraction 0.62-0.74 across four optimizers, vs 0.333 for $g_{imm}$), and the four-optimizer full-RTRL-vs-$d=0$ recovery gap tracks each optimizer's per-layer update-magnitude ratio $\|\Delta W_{hh}\|/\|\Delta W_{out}\|$ monotonically. A stationary (no-drift) control collapses both concentrations to ~0.6: the drift-specific signal is the differential, not $g_{past}$'s absolute rank-1 structure. The signature and the behavioral gap both collapse on LSTM, consistent with a mechanism specific to additive linear recurrence. On synthetic sine, $g_{imm}$ is redundant with $g_{past}$, which predicts the synthetic null. Full RTRL's one robust advantage is LARS (+17 to +27 pp), but $d=0$+LARS also fails to adapt independently; the gap is an optimizer$\times$method interaction, not a method-quality claim. We characterize the regime: $d=0$+Adam+float64 is robust; SGD, Adafactor, and float32 have specific fragilities documented in the paper. On the evaluated cells, the $1000\times$ memory saving at $n=1024$ ($O(n^2)$ vs $O(n^4)$) comes with no measured recovery cost.

Leave a Comment