Selective Safety Steering via Value-Filtered Decoding
arXiv:2605.14746v1 Announce Type: new
Abstract: While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model’s s…