How Transformers Learn to Plan via Multi-Token Prediction
arXiv:2604.11912v1 Announce Type: cross
Abstract: While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recen…