Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
arXiv:2605.06785v2 Announce Type: replace
Abstract: Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of condition…