Bat-Sheva Einbinder, Hen Davidov, Yee Whye Teh, Yarin Gal, Yaniv Romano

Selective Safety Steering via Value-Filtered Decoding

Bat-Sheva Einbinder, Hen Davidov, Yee Whye Teh, Yarin Gal, Yaniv Romano / May 15, 2026

arXiv:2605.14746v1 Announce Type: new
Abstract: While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model’s s…

Author name: Bat-Sheva Einbinder, Hen Davidov, Yee Whye Teh, Yarin Gal, Yaniv Romano

Selective Safety Steering via Value-Filtered Decoding