Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
arXiv:2603.22094v2 Announce Type: replace
Abstract: As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model s…