cs.LG, stat.ML

What should post-training optimize? A test-time scaling law perspective

arXiv:2605.10716v1 Announce Type: cross
Abstract: Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch…