Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
arXiv:2605.08541v1 Announce Type: new
Abstract: Neural scaling laws approximate a language model’s loss as a power-law function of parameter count $N$ and token count $D$. Following Chinchilla-style compute-optimal training, many studies fit scaling l…