The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
arXiv:2603.24124v2 Announce Type: replace-cross
Abstract: RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampl…