Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida / May 1, 2026

arXiv:2604.27495v1 Announce Type: cross
Abstract: Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inferenc…

Author name: Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

Debiasing Reward Models via Causally Motivated Inference-Time Intervention