cs.CL

Sparse Reward Subsystem in Large Language Models

arXiv:2602.00986v2 Announce Type: replace
Abstract: Recent studies show that LLM hidden states encode reward-related information, such as answer correctness and model confidence. However, existing approaches typically fit black-box probes on the full …