PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment
arXiv:2604.08986v1 Announce Type: new
Abstract: Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal person…