The two clocks and the innovation window: When and how generative models learn rules

arXiv:2605.10019v1 Announce Type: cross Abstract: Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $\tau_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $\tau_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $\tau_{\mathrm{rule}}$ and $\tau_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $\tau_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $\tau_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[\tau_{\mathrm{rule}}, \tau_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $\tau_{\mathrm{rule}} \geq \tau_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $\tau_{\mathrm{rule}}$, while training samples' basins begin to dominate around $\tau_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

Leave a Comment