Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
arXiv:2605.11107v1 Announce Type: new
Abstract: Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particul…