Calibration-Aware Policy Optimization for Reasoning LLMs
arXiv:2604.12632v1 Announce Type: cross
Abstract: Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as …