RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
arXiv:2503.06223v5 Announce Type: replace
Abstract: Large Vision-Language Models (VLMs) are increasingly deployed in open-ended environments, where ensuring reliable safety under multimodal inputs is critical. However, existing evaluations remain largely instruction-centric, focusing on explicit malicious queries while overlooking a more realistic and underexplored risk: whether safety alignment remains robust under harmful contextual exposure. This limitation is particularly important for multimodal systems, where visual inputs can substantially steer model behavior and render text-only auditing insufficient. In this work, we study multimodal safety auditing under harmful contextual exposure, asking whether VLMs can maintain safe behavior when partial toxic text is paired with visual context. To enable systematic auditing, we propose RedDiffuser (RedDiff), a reinforcement-based framework that leverages diffusion models to generate semantically coherent visual inputs for black-box safety testing. By combining greedy prompt search with reinforcement optimization, RedDiffuser uncovers high-risk multimodal inputs that expose latent safety failures. Extensive experiments on both open-source and commercial VLMs show that such context-conditioned failures are widespread. On LLaVA, RedDiffuser increases unsafe response rates by up to 10.69% on the original set and 8.91% on a hold-out set, with strong transferability to Gemini and LLaMA-Vision. These vulnerabilities persist even under external safety guardrails, suggesting that current system-level safety mechanisms remain insufficient for realistic multimodal risks. Our findings reveal a critical blind spot in existing safety evaluations and establish context-aware multimodal auditing as an essential paradigm for diagnosing hidden vulnerabilities in modern VLM systems.