cs.CL

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

arXiv:2604.13197v1 Announce Type: new
Abstract: Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expen…